CS 142: Program design and methodology
Spring Term, 2016
Lab 8
In this lab, you will use some of the data structures we’ve been learning about to analyze text.
Text analysis
Begin by creating a new project with a class called TextAnalysis. This class should have an attribute words
that is a Set and the constructor should initialize the words reference with a TreeSet.
(You can assign an TreeSet to a Set reference because TreeSet is a subclass of Set.) Then write a method
addWord that takes a string and adds it to the set.
To run this, create a main method that creates a TextAnalysis object and reads the contents of a file into
it:
public static void main(String[] args) throws FileNotFoundException {
Scanner input = new Scanner(new File("names.txt"));
TextAnalysis ta = new TextAnalysis();
while(input.hasNext()) {
ta.addWord(input.next());
}
}
You’ll need some import statements, which Eclipse will helpfully suggest. The throws clause is needed
because the File constructor will throw a FileNotFoundException if the file name is invalid and this
exception is of a type that must be either caught (in a try-catch block) or declared to be thrown.
For this to work, you need a file called names.txt for the program to open. Create an “Untitled text file”
using this option in the File→New menu. Put a couple of words in this file and then save it with the name
names.txt.
Once you’ve done that, check that you can compile and run the program. It has no output; to fix this, add
a method size that takes no arguments and returns the size of words (which has a method of the same
name...). Then have main call this method and print the result. Test your program by adding more words to
the input file and verifying that the output changes appropriately. (It should count the number of distinct
words since a Set won’t add duplicates.)
Use an iterator to write a method printWords that takes no arguments and prints each word in the set.
Recall the general structure of a loop that uses an iterator:
Iterator it = words.iterator();
while(it.hasNext()) {
String word = it.next();
...
}
Call this method from main to verify that the set contents are as you expect.
One problem with using this program for textual analysis is that it treats punctuation adjacent to a word as
part of the word itself. Thus, “Hi”, “Hi!”, and “Hi.” are stored as separate words. It also treats capitalization
as important so “hi” and “Hi” are stored as distinct words. (Verify these claims by changing your input.)
To fix these issues, we will convert all the words to lower case and remove punctuation at the beginning or
end of the word before storing it. To convert to lower case, use the String method toLowerCase, which
returns a version of the string with all the characters converted to lower case.
More sophisticated is to remove punctuation. The next method of Scanner will read in until the next
whitespace (space, tab, end of line, etc). Thus, words will include punctuation. Fix this by removing
punctuation characters from the beginning and end of each word before trying to add it to word. To identify
punctuation, use the function Character.isLetterOrDigit, which takes a char and returns whether it is a
letter or digit. Use this function in loops to find the first and last letter/digit in the word and then use the
substring method to extract the part of the word between these characters. (Be sure to handle the case
where the entire word is punctuation, in which case nothing should be added to the Set.)
Once you’ve thoroughly tested the new functionality for handling punctuation and capitalization differences,
it’s time to convert your program into one that counts the number of occurrences of each word. Do this by
converting words into a TreeMap from String to Integer. The Integer should be the number of times that
word has appeared. (When you try to add a word, use the Map method containsKey to see if it’s already
been seen. If not, add it with value 1. Otherwise, get its current value and increment it by one. You can
find the functions supported by Map by googling “Java API Map”.)
Once you’re confident that you can count the occurrences in a text, use a larger data set. I suggest down-
loading a book from Project Gutenberg (http://www.gutenberg.org); one I often use is the complete works
of Shakespeare (https://www.gutenberg.org/ebooks/100 (select “More files” and take the .txt version)).
To see if this is working, you’ll want to print just some of the words found.
Some text analyses rely on the relative frequency of stop words, words whose purpose is syntactic rather than
related to content. For example, how often do I use “also” or “moreover”? This can be used to hypothesize
about document authorship. (Ask Prof. Dooley about this; he and a student performed a more sophisticated
version of this sort of analysis on some disputed texts.)
Another direction is to look at the frequency of short phrases. For example, we could consider the frequency
of bigrams, pairs of consecutive words. For example, the sentence “hi there class” has bigrams “hi there”
and “there class”. Bigrams are used in cryptography and spelling correction. Write a class that reads words
from a Scanner and gives the text’s bigrams one at a time; replace the program’s Scanner with an object of
this type. (Your class should have a constructor that takes a Scanner. It will also need an attribute to store
the last word it read. Its next method should read from the Scanner and use the result plus its stored word
to create a bigram to return. (Return the bigram as a String with the two words separated by a space;
do this after converting them to lowercase and removing punctuation from the beginning and end of each
word.) You’ll also need a method hasNext that tells if there are any more bigrams.)
If you have more time, see if you can convert your class into a general n-gram counter; an n-gram is composed
of n consecutive words.