Michael Bozada
bozada.2@wright.edu
Quinn Hirt
hirt.14@wright.edu
Cranfield dataset files downloadable from Prasad's website.
corpus/cran.all has been slightly modified to make processing the dataset for Lucene easier.
Index created by src/Main.java using Lucene.
Needs to be deleted anytime createIndex = true.
Input files required by Main.java.
- corpus_data.txt: corpus/cran.all processed using scripts/split_corpus.py
- myQuery.txt: Test queries manually extracted from corpus/query.text
- myQueryRels.txt: corpus/qrels.text processed using scripts/format_relevance.py Follows the format of "query_id relevant_doc_id...".
- stop_words_english.txt: Downloaded from countwordsfree.com
Lucene 9.0.0 required .jar files.
Output of running the 20 selected test queries with diffierent Search Engine configurations.
- 1_standard_resultx.txt: StandardAnalyzer, Single Field Query Parser
- 2_multiIndex_resultx.txt: StandardAnalyzer, Boosted MultiField Query Parser
- 3_stopWords_resultx.txt: StopAnalyzer, Single Field Query Parser
- 4_multiIndex_stopWords_resultx.txt: StopAnalyzer, Boosted MultiField Query Parser
Python Scripts used to prepare the corpus for Lucene.
- format_relevance.py: Creates input/myQueryRels.txt using corpus/qrels.text
- split_corpus.py: Creates input/corpus_data.txt using corpus/cran.all
Main.java is the singular source file. Run to use the Search Engine.
createIndex() on line 129
Pretty straightforward. Can be run with the StopAnalyzer if needed.
searchIndex() on line 164
Uses either StandardAnalyzer or StopAnalyzer depending on the set flags.
If multiIndex is set to true, it will use the MultiIndexQueryParser with boosted Fields.
We used Total Hits, Recall, Precision, and MAP to evaluate different configurations.
Due to inexperience, it's difficult to say whether the different configurations "enhanced" search results.
Surprisingly, the standard configuration had the highest MAP at 0.011253, but that's obviously a pretty low number.
With just the multiIndexer enabled, hits blew up across the board. That let Recall = 1.000 for almost all queries, but Recall isn't a good metric when you're retrieiving every document...
Enabling only stop words resulted in the lowest MAP, but had mixed performance on individual queries. It brought hits to their lowest numbers, but Recall and Precision were frequently 0. Maybe a different stop words list would improve results.
Finally, using both the boosted multiIndex and stopWords analyzer gave somewhat confusing results. I don't know if it's the best or worst of both worlds, but performance seems averaged between the extremes of just MultiIndex or stopWords.
In conclusion, working with Lucene was interesting but frustrating. It's difficult to assess performance and even more difficult to concretely improve search results.