Homework 7 Due: Oct. 30, 2013 (before class) October 21, 2013 Problem 1: Query Expansion (40pt) In this problem, you are asked to implement query expansion using Lucene library. You will use the same document collection and queries that were used in Homework 5, which can also be found at http://www.cse.msu.edu/ ˜cse484/hw/hw4.zip. The assignment is comprised of two parts. In the first part, you are asked to implement a simple algorithm of query expansion that expands a query with the most frequent words appearing in the top ranked documents. In the second part, you are asked to identify the limitation of the approach presented in the first, and come up with your solution to address the limitation. Part I (20pt) In this phase, you are asked to implement a simple heuristic to expand the queries. To facilitate your development, you are provided with a simple template Java code that can be downloaded from http://www.cse. msu.edu/˜cse484/hw/hw7_code.zip. In the downloaded file, you will find two directories: IndexTREC and BatchSearch. Files under IndexTREC will be used for document indexing and files under BatchSearch will be used for query expansion. You need to create a java project for each directory, and compiled them in class files. Your implementation of query expansion will go to the file BatchSearch.java under the directory BatchSearch, in which we provide a brief instruction for implementing query expansion. You need to accomplish the following tasks in this part of homework: • Index the collection of documents using IndexTREC. Note that you should not re-use the index generated in the previous assignments. • For each query in in /query/query.sgml, to find the most related words, you will first retrieve the top 100 ranked documents, and count, for each non-query word, the number of top ranked document it appears. You will then return the 20 most frequent non-query words appearing in the top 100 ranked documents. • Submit your code and the expanded query words for each query. Note that all the expanded query words should not appear in the original query. Discuss your observation of the expanded query words. Part II (20pt) Based on your observation from Part I, devise a better strategy for query expansion that alleviates the limitation of the approach presented in Part I. You need to submit (1) a short description for your strategy of query expansion, (2) implementation of your algorithm, and (3) expanded query words that do not appear in the original query. 1