SearchEngine

Group Members

Directories and Files

corpus/

Cranfield dataset files downloadable from Prasad's website.
corpus/cran.all has been slightly modified to make processing the dataset for Lucene easier.

index/

Index created by src/Main.java using Lucene.
Needs to be deleted anytime createIndex = true.

input/

Input files required by Main.java.

corpus_data.txt: corpus/cran.all processed using scripts/split_corpus.py
myQuery.txt: Test queries manually extracted from corpus/query.text
myQueryRels.txt: corpus/qrels.text processed using scripts/format_relevance.py Follows the format of "query_id relevant_doc_id...".
stop_words_english.txt: Downloaded from countwordsfree.com

lib/

Lucene 9.0.0 required .jar files.

output/

Output of running the 20 selected test queries with diffierent Search Engine configurations.

1_standard_resultx.txt: StandardAnalyzer, Single Field Query Parser
2_multiIndex_resultx.txt: StandardAnalyzer, Boosted MultiField Query Parser
3_stopWords_resultx.txt: StopAnalyzer, Single Field Query Parser
4_multiIndex_stopWords_resultx.txt: StopAnalyzer, Boosted MultiField Query Parser

scripts/

Python Scripts used to prepare the corpus for Lucene.

format_relevance.py: Creates input/myQueryRels.txt using corpus/qrels.text
split_corpus.py: Creates input/corpus_data.txt using corpus/cran.all

src/

Main.java is the singular source file. Run to use the Search Engine.

Report

Indexer

createIndex() on line 129
Pretty straightforward. Can be run with the StopAnalyzer if needed.

Index Searcher

searchIndex() on line 164
Uses either StandardAnalyzer or StopAnalyzer depending on the set flags.
If multiIndex is set to true, it will use the MultiIndexQueryParser with boosted Fields.

Lucene Experience

We used Total Hits, Recall, Precision, and MAP to evaluate different configurations.

Due to inexperience, it's difficult to say whether the different configurations "enhanced" search results.

Surprisingly, the standard configuration had the highest MAP at 0.011253, but that's obviously a pretty low number.

With just the multiIndexer enabled, hits blew up across the board. That let Recall = 1.000 for almost all queries, but Recall isn't a good metric when you're retrieiving every document...

Enabling only stop words resulted in the lowest MAP, but had mixed performance on individual queries. It brought hits to their lowest numbers, but Recall and Precision were frequently 0. Maybe a different stop words list would improve results.

Finally, using both the boosted multiIndex and stopWords analyzer gave somewhat confusing results. I don't know if it's the best or worst of both worlds, but performance seems averaged between the extremes of just MultiIndex or stopWords.

In conclusion, working with Lucene was interesting but frustrating. It's difficult to assess performance and even more difficult to concretely improve search results.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.vscode		.vscode
corpus		corpus
input		input
lib		lib
output		output
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SearchEngine

Group Members

Directories and Files

corpus/

index/

input/

lib/

output/

scripts/

src/

Report

Indexer

Index Searcher

Lucene Experience

About

Uh oh!

Releases

Packages

Languages

License

mbozada/SearchEngine

Folders and files

Latest commit

History

Repository files navigation

SearchEngine

Group Members

Directories and Files

corpus/

index/

input/

lib/

output/

scripts/

src/

Report

Indexer

Index Searcher

Lucene Experience

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages