Skip to content

mbozada/SearchEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Group Members

Michael Bozada
bozada.2@wright.edu

Quinn Hirt
hirt.14@wright.edu

Directories and Files

corpus/

Cranfield dataset files downloadable from Prasad's website.
corpus/cran.all has been slightly modified to make processing the dataset for Lucene easier.

index/

Index created by src/Main.java using Lucene.
Needs to be deleted anytime createIndex = true.

input/

Input files required by Main.java.

  • corpus_data.txt: corpus/cran.all processed using scripts/split_corpus.py
  • myQuery.txt: Test queries manually extracted from corpus/query.text
  • myQueryRels.txt: corpus/qrels.text processed using scripts/format_relevance.py Follows the format of "query_id relevant_doc_id...".
  • stop_words_english.txt: Downloaded from countwordsfree.com

lib/

Lucene 9.0.0 required .jar files.

output/

Output of running the 20 selected test queries with diffierent Search Engine configurations.

  • 1_standard_resultx.txt: StandardAnalyzer, Single Field Query Parser
  • 2_multiIndex_resultx.txt: StandardAnalyzer, Boosted MultiField Query Parser
  • 3_stopWords_resultx.txt: StopAnalyzer, Single Field Query Parser
  • 4_multiIndex_stopWords_resultx.txt: StopAnalyzer, Boosted MultiField Query Parser

scripts/

Python Scripts used to prepare the corpus for Lucene.

  • format_relevance.py: Creates input/myQueryRels.txt using corpus/qrels.text
  • split_corpus.py: Creates input/corpus_data.txt using corpus/cran.all

src/

Main.java is the singular source file. Run to use the Search Engine.

Report

Indexer

createIndex() on line 129
Pretty straightforward. Can be run with the StopAnalyzer if needed.

Index Searcher

searchIndex() on line 164
Uses either StandardAnalyzer or StopAnalyzer depending on the set flags.
If multiIndex is set to true, it will use the MultiIndexQueryParser with boosted Fields.

Lucene Experience

We used Total Hits, Recall, Precision, and MAP to evaluate different configurations.

Due to inexperience, it's difficult to say whether the different configurations "enhanced" search results.

Surprisingly, the standard configuration had the highest MAP at 0.011253, but that's obviously a pretty low number.

With just the multiIndexer enabled, hits blew up across the board. That let Recall = 1.000 for almost all queries, but Recall isn't a good metric when you're retrieiving every document...

Enabling only stop words resulted in the lowest MAP, but had mixed performance on individual queries. It brought hits to their lowest numbers, but Recall and Precision were frequently 0. Maybe a different stop words list would improve results.

Finally, using both the boosted multiIndex and stopWords analyzer gave somewhat confusing results. I don't know if it's the best or worst of both worlds, but performance seems averaged between the extremes of just MultiIndex or stopWords.

In conclusion, working with Lucene was interesting but frustrating. It's difficult to assess performance and even more difficult to concretely improve search results.

About

Search Engine using Lucene created for CS7800 Information Retrieval.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published