Java程序辅导

C C++ Java Python Processing编程在线培训程序编写软件开发视频讲解

QQ：2653320439 微信：ittutor Email：itutor@qq.com

CS 142: Program design and methodology Spring Term, 2016 Lab 8 In this lab, you will use some of the data structures we’ve been learning about to analyze text. Text analysis Begin by creating a new project with a class called TextAnalysis. This class should have an attribute words that is a Set and the constructor should initialize the words reference with a TreeSet. (You can assign an TreeSet to a Set reference because TreeSet is a subclass of Set.) Then write a method addWord that takes a string and adds it to the set. To run this, create a main method that creates a TextAnalysis object and reads the contents of a file into it: public static void main(String[] args) throws FileNotFoundException { Scanner input = new Scanner(new File("names.txt")); TextAnalysis ta = new TextAnalysis(); while(input.hasNext()) { ta.addWord(input.next()); } } You’ll need some import statements, which Eclipse will helpfully suggest. The throws clause is needed because the File constructor will throw a FileNotFoundException if the file name is invalid and this exception is of a type that must be either caught (in a try-catch block) or declared to be thrown. For this to work, you need a file called names.txt for the program to open. Create an “Untitled text file” using this option in the File→New menu. Put a couple of words in this file and then save it with the name names.txt. Once you’ve done that, check that you can compile and run the program. It has no output; to fix this, add a method size that takes no arguments and returns the size of words (which has a method of the same name...). Then have main call this method and print the result. Test your program by adding more words to the input file and verifying that the output changes appropriately. (It should count the number of distinct words since a Set won’t add duplicates.) Use an iterator to write a method printWords that takes no arguments and prints each word in the set. Recall the general structure of a loop that uses an iterator: Iterator it = words.iterator(); while(it.hasNext()) { String word = it.next(); ... } Call this method from main to verify that the set contents are as you expect. One problem with using this program for textual analysis is that it treats punctuation adjacent to a word as part of the word itself. Thus, “Hi”, “Hi!”, and “Hi.” are stored as separate words. It also treats capitalization as important so “hi” and “Hi” are stored as distinct words. (Verify these claims by changing your input.) To fix these issues, we will convert all the words to lower case and remove punctuation at the beginning or end of the word before storing it. To convert to lower case, use the String method toLowerCase, which returns a version of the string with all the characters converted to lower case. More sophisticated is to remove punctuation. The next method of Scanner will read in until the next whitespace (space, tab, end of line, etc). Thus, words will include punctuation. Fix this by removing punctuation characters from the beginning and end of each word before trying to add it to word. To identify punctuation, use the function Character.isLetterOrDigit, which takes a char and returns whether it is a letter or digit. Use this function in loops to find the first and last letter/digit in the word and then use the substring method to extract the part of the word between these characters. (Be sure to handle the case where the entire word is punctuation, in which case nothing should be added to the Set.) Once you’ve thoroughly tested the new functionality for handling punctuation and capitalization differences, it’s time to convert your program into one that counts the number of occurrences of each word. Do this by converting words into a TreeMap from String to Integer. The Integer should be the number of times that word has appeared. (When you try to add a word, use the Map method containsKey to see if it’s already been seen. If not, add it with value 1. Otherwise, get its current value and increment it by one. You can find the functions supported by Map by googling “Java API Map”.) Once you’re confident that you can count the occurrences in a text, use a larger data set. I suggest down- loading a book from Project Gutenberg (http://www.gutenberg.org); one I often use is the complete works of Shakespeare (https://www.gutenberg.org/ebooks/100 (select “More files” and take the .txt version)). To see if this is working, you’ll want to print just some of the words found. Some text analyses rely on the relative frequency of stop words, words whose purpose is syntactic rather than related to content. For example, how often do I use “also” or “moreover”? This can be used to hypothesize about document authorship. (Ask Prof. Dooley about this; he and a student performed a more sophisticated version of this sort of analysis on some disputed texts.) Another direction is to look at the frequency of short phrases. For example, we could consider the frequency of bigrams, pairs of consecutive words. For example, the sentence “hi there class” has bigrams “hi there” and “there class”. Bigrams are used in cryptography and spelling correction. Write a class that reads words from a Scanner and gives the text’s bigrams one at a time; replace the program’s Scanner with an object of this type. (Your class should have a constructor that takes a Scanner. It will also need an attribute to store the last word it read. Its next method should read from the Scanner and use the result plus its stored word to create a bigram to return. (Return the bigram as a String with the two words separated by a space; do this after converting them to lowercase and removing punctuation from the beginning and end of each word.) You’ll also need a method hasNext that tells if there are any more bigrams.) If you have more time, see if you can convert your class into a general n-gram counter; an n-gram is composed of n consecutive words.