Lab 1: Text Processing in Linux; N-grams
Massimo Poesio
30 Settembre 2003
Linux machines come with a lot of software that can be used to manipulate
text. The first goal of this lab is to get yourselves familiarized with some of
this software. In the second part of the lab, we will look at how n-grams can be
counted, and how these counts can be used to estimate probabilities.
1 Piping the output of one command into the
input of a second one
One example of the useful software that comes with Unix is tr. tr is a
TRanslation program - it takes two arguments, a pattern and a replacement,
which replaces all the words in the input matching the pattern. Get the file
HLT data mining.txt from the web page, in the folder data, and type:
cat HLT data mining.txt | tr "A-Z" "a-z"
As you can see, tr used in this way converts all uppercase characters in its input
into lowercase ones.
An important concept used in this first example is that of pipeline–a series
of processes each of which takes some text for input, does some work with it,
and outputs something that can be in turn used by another process. What you
typed in is actually two programs: cat HLT data mining.txt, which reads a
file and prints it out; and tr, which reads its STANDARD INPUT and outputs
a modified version of what it reads. A lot of UNIX software can be used in
this way; UNIX makes this easy to do because all shells - the programs that
read your input from the terminal windows - understand commands like the one
above, using the special symbol |:
program1 | program2 | program3
What this means is: take the output of program1, and feed it to program2 as
input; the output of program2 will in turn serve as the input to program3. A
lot of text processing can be done in UNIX simply by creating a pipeline which
connects two or more commands.
More in general, tr can be used to perform all sorts of transformations on
its input. Another popular use of tr is to find the words contained in a text,
1
especially in combination with uniq (see below). This is done using the following
magic incantation:
cat HLT\_data\_mining.txt | tr -sc ’A-Za-z’ ’\012’
Exercise: Try this, and then try to understand the result you get by looking
at the tr online manual page, which, as you know, can be found by typing man
tr in a terminal window.
A third example of pipeline is the following, in which the output of the
previous pipeline is fed to the perl interpreter. (Make sure that everything is
entered as part of a single line, else the pipeline will be ‘broken’ !)
cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~
s/\s+/\n/g; print $_;’
Perl, discussed below, is an extremely powerful language for text manipulation,
which allows you to specify regular expressions. In this example, perl is used to
perform a tokenization of its input: the -ne option means “perform the specified
command on every line of input”. What the command in quotes says is: print
the current line ($ ) after replacing every sequence of white space (recall that the
metacharacter \s means ‘any white space’, and + means ‘1 or more repetitions’)
with a newline character. The result of this pipeline should be a file with a
separate word on each line. (Notice how many of these ‘tokens’ are not really
words you’d find in a lexicon ... )
Exercise: The ‘tokenizer’ above is slightly less crude than what we did with
tr before, but still pretty basic. For example, in the output of this pipeline
punctuation is still attached to the preceding word (check out for example the
line with ‘machines’). Can you think of a way of fixing this? (Hint: just add
one more step to the pipeline ... )
Another useful UNIX program is sort, which can sort its input according
to different criteria, and output the result of this sort. Try to add sort at the
‘end’ of the pipelines above - first after just tr, then after the ‘tokenizer’. E.g.,
type:
cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~
s/\s+/\n/g; print $_;’ | sort
(Make sure you also try to add sort after the ‘punctuation remover’ that you
developed as part of the exercise above.)
And finally, we can remove duplicate lines, using another Unix standard
command, uniq:
cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~
s/\s+/\n/g; print $_;’ | perl -ne’$_ =~ s/([\.:\;\,\?\!\"])/\n$1/g; print
$_;’ | sort | uniq
2
uniq can also be used to count the number of instances of each token that we
found - try to use uniq -c instead of uniq at the end of the pipeline above.
Another command that can be used to comment things is wc. suppose you
want to count how many files you have in the current directory. You could
then run the ls -l command, which outputs a list of all the files in the current
directory one per line (ls is similar, but more powerful than, the dir program
available under DOS), and then count how many lines there are in the output
of the program using the command wc -l:
ls -l | wc -l
A slightly more sophisticated version of this is a program that counts all the
JAVA files in the current directory - i.e., all the files with suffix .java. This can
be done by adding a FILTER program in between ls and wc in the pipeline
we just saw - i.e., a program that selects some of the lines output by ls. One
program that can be used to do this is egrep. egrep, further discussed below,
takes as input a REGULAR EXPRESSION like those discussed in class, and
outputs the lines of input that contain words that match that expression:
ls -l | egrep java | wc -l
Here is a short list of some of the Unix commands that are most useful for
text processing:
cat This command simply outputs the contents of the file, and is useful to ‘start’
pipelines, as seen in the example above - i.e., to feed input to commands
like tr that do not take their input from files, but from the standard
output.
egrep, grep For search. See below.
perl A very general text processor. See below.
sort Sorts lines in alphabetical order.
tr Performs various transformations on its input.
uniq Removes duplicate lines (if they occur one after the other).
wc Word Count. It outputs 3 figures: the number of lines, words, and bytes in
a file (or the standard input).
To learn more about a command in UNIX, use the online manual, which can be
read with the man command - for example, to learn more about sort, type:
man sort
The perl interpreter has particulary extensive online documentation - better
than most manuals.
3
2 Search commands: grep and egrep
There are many tools to search for patterns in UNIX; the simplest and more
popular are grep and egrep. Both can search for patterns either in a list of
files, using the syntax:
egrep PATTERN FILE1 FILE2
where PATTERN is a regular expression like those discussed in class, or from
the standard input - useful if you want to use them in a pipeline. For example,
if you want to see how many files with suffix ‘.txt’ you have in the current
directory, you can use the following pipeline, which first of all lists the contents
of the present directory in ‘long’ format using the command ls (similar to dir
in Windows), and then counts how many lines there are in the output of egrep
using the command wc -l:
ls -l | egrep ’\.txt’ | wc -l
Notice that egrep is used here as a ‘filter’. For many more examples of use
of grep and related commands, check out the very useful online introduction
written by B. Dowling:
http://www-uxsup.csx.cam.ac.uk/courses/Text/chap0.ps
Exercise: Using egrep, search for lines containing eithere ’text mining’ or
’data mining’ in the file HLT data mining.txt.
3 Perl
Perl, of which use we saw an example earlier, has excellent online manuals:
check out
man perl
so we won’t say much about it here, except that there are three basic ways of
using it. If all you want is to execute a Perl command on all lines of a file, you
can do what we did in the examples above. If you want to do something more
complex, you can write a program in Perl, save it in a file (say, with suffix ‘.pl’
or ‘.perl’) and then invoke it as follows:
perl FILE.pl
where FILE is the program that you just wrote. For example, the following
highly complex Perl program prints out “Hello, world!”:
# hello.pl
# A complex Perl program
print "Hello, world!\n";
4
Save it in a file - say, ’hello.pl’ and then try to execute it as said above. Alter-
natively, you can put the following in the first line:
#!/usr/local/bin/perl
This is an instruction to the shell to treat this file as an executable program (it’s
important this is the FIRST line!). Once you made this change, AND allowed
yourself and others to execute the file, as follows:
chmod ugo+x hello.pl
You can then execute the program by simply typing its name on the command
line, i.e.,:
hello.pl
Exercise: Write a perl program that does what the simple ‘tokenizer’ did in
a previous example, and can be executed from the command line. Put this com-
mand in a pipeline in place of the calls to perl in the examples above. Hint:
in order to execute Perl commands on each line of input, include them in the
following while loop:
while () {
.... PUT YOUR COMMANDS HERE
}
4 Shell scripts
Shell scripts are the UNIX equivalent of ‘batch programs’ in Windows. Shell
scripts are particularly useful to combine commands like the one we have seen
in this lab into a single command. Instead of typing every time the sequence of
commands we were trying in the examples above, we could write a shell script
that does that (the shell script below, as all the other examples below, can be
found in the web site, in the folder code):
#!/bin/sh
# extract_tokens.sh
# Extract tokens from the file HLT_data_mining.txt
cat HLT_data_mining.txt | \
tr "A-Z" "a-z" | \
perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | \
perl -ne’$_ =~ s/([\.:\;\,\?\!\"])/\n$1/g; print $_;’ | \
sort | \
uniq
5
Notice a few key points about this program. First of all, the first line uses the
same method discussed above when talking about Perl to tell the shell that this
is an executable program - except that the interpreter this time is the Bourne
Shell, /bin/sh:
#!/bin/sh
(Sometimes this doesn’t work because of the way the shell is set up - in these
cases, just pass the script as an argument to the shell.) Secondly, notice that
in order to put the commands in the pipeline on a separate line, I added a
backslash (‘\’) at the end of each line.
This shell script is not very useful: it can only be used to extract tokens
from the file HLT_data_mining.txt. A more useful script would take the name
of the file from the command line. The following script illustrates the use of the
for command of the Bourne shell to loop over all the arguments, as well as the
use of variables:
#!/bin/sh
# et_uniq.sh
# Extract tokens from files in the command line
# Usage: et_uniq.sh FILE FILE
for arg
do
cat $arg | \
tr "A-Z" "a-z" | \
perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | \
perl -ne’$_ =~ s/([\.:\;\,\?\!\"])/\n$1/g; print $_;’ | \
sort | \
uniq
done
5 Counting words using standard UNIX tools
Now that we looked at tokenization, let us go into what we are mostly interested
in, counting. Again, you can do quite a lot of counting using only standard
Unix.1 In order to do this, you need to:
1. Tokenize the text,
2. Sort it,
3. Count duplicates.
1A lot of the ideas in the following sections come from Ken Church’s Unix for Poets,
available on the Web.
6
We have seen before how to write a tokenizer in Perl. This can be done a
bit more simply (although not as precisely) with tr, identifying the strings
consisting only of alphabetic characters and put each of them in a separate line
using the following idiom with tr:
tr -sc ’A-Za-z’ ’\012’
This command TRanslates every character NOT in the set ’A-Za-z’ into a
newline, and Squeezes extra blank lines.
Exercise: Earlier on you should have copied the file HLT data mining.txt
into your own directory. (If you lost it, it’s in the web page, under data/HLT data mining.txt;
copy it now.) Run tr with the options just seen on your file, and look at the
output.
You can then pipe the output of this call to tr to sort, and finally you can
use uniq -c to count the tokens.
Exercise: Run the whole pipeline on HLT data mining.txt, save the output
in a file, and look at the counts. Which word has the highest count?
Exercise: You will have noticed that the pipeline we used here suffers from
having a too basic tokenizer. For example, you get separate counts for The and
the. Modify the pipeline to use the tokenizer we developed previously.
6 Counting bigrams using standard Unix tools
Bigram counts, as well, can be obtained using just standard Unix tools (includ-
ing paste and tail). The basic algorithm for doing this is:
1. tokenize (as above)
2. print wordi and wordi+1 on the same line
3. count
We have now seen many ways of doing the first step. The second can be done in
two substeps: first create an ‘offset’ file using tail–a file which contains all but
the first word of the original file. Then use paste to create a third file whose
lines contain one word from the original file and one from the second one. In
summary,
tr -sc ’A-Za-z’ ’\012’ < HLT_data_mining.txt > HLT_dm_tokens.txt
tail +2 HLT_dm_tokens.txt > HLT_dm_nexttokens.txt
paste HLT_dm_tokens.txt HLT_dm_nexttokens.txt > HLT_dm_bigrams.txt
cat HLT_dm_bigrams.txt | sort | uniq -c
7
7 Estimating probabilities and smoothing
To get a simple idea of how these unigrams and bigrams are used to build lan-
guage models, look at the two perl scripts MP unigrams.pl and MP bigrams.pl
(in the code folder). These two scripts expect as input lists of unigram and
bigram tokens as produced by the methods above, and estimate unigram and
bigram probabilities using MLE. In order to exemplify smoothing, two addi-
tional tokens are considered. (In the real world, of course, we would use a
dictionary to find missing n-grams.)
8 Using a more realistic corpus
Of course, the file you have been using so far cannot really be considered a
realistic corpus! These days, the types of corpora used in NLE applications are
at least 100M. Working with this files of this size introduces lots of additional
problems, however. So, we’ll do something simpler, but that should still give
you a feeling of what a ’real’ estimation exercise looks like.
Get the files training.txt and test.txt from the web page, directory
data. The two files contain the text of all the Jane Austen novels (from the
Manning and Schuetze’s companion web site to their textbook, Foundations of
Statistical NLP).
Exercise: Use the programs discussed earlier to estimate unigram and bigram
probabilities in this corpus.
8