Hash Tables Lab Assignment 4 Hugo Serrano June 7, 2013 1 Introduction In this lab, you will have to implement a hash table and test how different hash function techniques impact the performance of such data structure. Topics covered in this activity: • Hash tables; • Linked lists; • Hash functions; • Collision Resolution with Chaining 1.1 The Problem The Postal Service is developing a new technology that aims to inform its users at the time of posting, the estimated time for delivery of a mail. This technology will be embedded in mailboxes. When a mail is inserted into the mailbox, a scanner reads the origin and destination on the envelope, and the system calculates the total distance that the letter will travel. 1.2 The System For this task, you will have to implement the following components of the sys- tem: Storage system - for performance, all the data must be stored in a hash table. Search engine - the search engine will receive as input a ZIP code (key) and will retrieve from the hash table the latitude and longitude (value) of the region corresponding to the ZIP code. Distance calculator - based on the coordinates of two ZIP codes (origin and destination), the calculator must provide the distance between the two points in miles or kilometers. 1 1.3 The Dataset The dataset you will use for this task is a CSV file (zipcodes.csv) with three columns, namely zip, latitude and longitude like in the example below. 1. 00501,40.81,-73.04 00544,40.81,-73.04 00601,18.16,-66.72 00602,18.38,-67.18 00603,18.43,-67.15 00604,18.43,-67.15 00605,18.43,-67.15 00606,18.18,-66.98 00610,18.28,-67.14 2 The Task Your code must provide the following functions: 1. Insert new entries in the table; 2. Get the latitude and longitude values from the hash table for a given ZIP code. 3. Read the CSV file and insert all the entries into the hash table using the ZIP codes as keys; 4. Keys are strings, thus they must be prehashed so they can be represented as nonnegative integers. For prehashing, you can use the hashCode method for strings; 5. Hash the (already prehashed) keys using either division or multiplication method; 6. Collisions must be resolved by separate chaining; 7. Print out the load factor α; 8. Measure and plot the search times for 1000 to 30000 ZIP codes in incre- ments of 1000. 9. Print out the average length of the linked lists µl = M∑ i=0 length(ti) M 1The file is available on http://my.fit.edu/~hbarbosafilh2011/teaching/zipcodes.csv 2 The distance d must be calculated using the haversine formula. d = R× 2 arcsin (√ sin2 ( φ2 − φ1 2 ) + cos(φ1) cos(φ2) sin 2 ( λ2 − λ1 2 )) where R is the radius of the Earth (3961 miles or 6373 km), φ1 and φ2 are the latitudes of the points 1 and 2 respectively, and λ1 and λ2 are the longitudes of points 1 and 2 respectively. 2 2.1 Extra marks You will receive extra marks for the following functionalities: • Calculate the standard deviation σl of linked lists’ length (5 points); • Plot the histogram with the frequencies of the lists’ lengths using some plotting library for Java (e.g. JFree Chart) (5 points) ; 3 Grading You will be graded based on the following criteria: 1. Correct Implementation (5 points for each functionality in the tasks list). 2. Complete understanding of the data structures - both the hash table and the linked list -, their operations and respective implementations (20 points); 3. Quality of the hash function based on its hashing uniformity (10 points); 4. Execution (10 points) 5. Originality (5 points) 6. Comments (5 points) 7. Additional functionalities (5 points) Additionally, the best unique hash function will receive 5 more extra marks. The function must show good uniformity and must be unique. If two or more students end up with the same best function, the marks will go to the second best, and so on. 2http://en.wikipedia.org/wiki/Haversine_formula 3