Java程序辅导

C C++ Java Python Processing编程在线培训程序编写软件开发视频讲解

QQ：2653320439 微信：ittutor Email：itutor@qq.com

Lab 1: Text Processing in Linux; N-grams Massimo Poesio 30 Settembre 2003 Linux machines come with a lot of software that can be used to manipulate text. The first goal of this lab is to get yourselves familiarized with some of this software. In the second part of the lab, we will look at how n-grams can be counted, and how these counts can be used to estimate probabilities. 1 Piping the output of one command into the input of a second one One example of the useful software that comes with Unix is tr. tr is a TRanslation program - it takes two arguments, a pattern and a replacement, which replaces all the words in the input matching the pattern. Get the file HLT data mining.txt from the web page, in the folder data, and type: cat HLT data mining.txt | tr "A-Z" "a-z" As you can see, tr used in this way converts all uppercase characters in its input into lowercase ones. An important concept used in this first example is that of pipeline–a series of processes each of which takes some text for input, does some work with it, and outputs something that can be in turn used by another process. What you typed in is actually two programs: cat HLT data mining.txt, which reads a file and prints it out; and tr, which reads its STANDARD INPUT and outputs a modified version of what it reads. A lot of UNIX software can be used in this way; UNIX makes this easy to do because all shells - the programs that read your input from the terminal windows - understand commands like the one above, using the special symbol |: program1 | program2 | program3 What this means is: take the output of program1, and feed it to program2 as input; the output of program2 will in turn serve as the input to program3. A lot of text processing can be done in UNIX simply by creating a pipeline which connects two or more commands. More in general, tr can be used to perform all sorts of transformations on its input. Another popular use of tr is to find the words contained in a text, 1 especially in combination with uniq (see below). This is done using the following magic incantation: cat HLT\_data\_mining.txt | tr -sc ’A-Za-z’ ’\012’ Exercise: Try this, and then try to understand the result you get by looking at the tr online manual page, which, as you know, can be found by typing man tr in a terminal window. A third example of pipeline is the following, in which the output of the previous pipeline is fed to the perl interpreter. (Make sure that everything is entered as part of a single line, else the pipeline will be ‘broken’ !) cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~ s/\s+/\n/g; print $_;’ Perl, discussed below, is an extremely powerful language for text manipulation, which allows you to specify regular expressions. In this example, perl is used to perform a tokenization of its input: the -ne option means “perform the specified command on every line of input”. What the command in quotes says is: print the current line ($ ) after replacing every sequence of white space (recall that the metacharacter \s means ‘any white space’, and + means ‘1 or more repetitions’) with a newline character. The result of this pipeline should be a file with a separate word on each line. (Notice how many of these ‘tokens’ are not really words you’d find in a lexicon ... ) Exercise: The ‘tokenizer’ above is slightly less crude than what we did with tr before, but still pretty basic. For example, in the output of this pipeline punctuation is still attached to the preceding word (check out for example the line with ‘machines’). Can you think of a way of fixing this? (Hint: just add one more step to the pipeline ... ) Another useful UNIX program is sort, which can sort its input according to different criteria, and output the result of this sort. Try to add sort at the ‘end’ of the pipelines above - first after just tr, then after the ‘tokenizer’. E.g., type: cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | sort (Make sure you also try to add sort after the ‘punctuation remover’ that you developed as part of the exercise above.) And finally, we can remove duplicate lines, using another Unix standard command, uniq: cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | perl -ne’$_ =~ s/([\.:\;\,\?\!\"])/\n$1/g; print $_;’ | sort | uniq 2 uniq can also be used to count the number of instances of each token that we found - try to use uniq -c instead of uniq at the end of the pipeline above. Another command that can be used to comment things is wc. suppose you want to count how many files you have in the current directory. You could then run the ls -l command, which outputs a list of all the files in the current directory one per line (ls is similar, but more powerful than, the dir program available under DOS), and then count how many lines there are in the output of the program using the command wc -l: ls -l | wc -l A slightly more sophisticated version of this is a program that counts all the JAVA files in the current directory - i.e., all the files with suffix .java. This can be done by adding a FILTER program in between ls and wc in the pipeline we just saw - i.e., a program that selects some of the lines output by ls. One program that can be used to do this is egrep. egrep, further discussed below, takes as input a REGULAR EXPRESSION like those discussed in class, and outputs the lines of input that contain words that match that expression: ls -l | egrep java | wc -l Here is a short list of some of the Unix commands that are most useful for text processing: cat This command simply outputs the contents of the file, and is useful to ‘start’ pipelines, as seen in the example above - i.e., to feed input to commands like tr that do not take their input from files, but from the standard output. egrep, grep For search. See below. perl A very general text processor. See below. sort Sorts lines in alphabetical order. tr Performs various transformations on its input. uniq Removes duplicate lines (if they occur one after the other). wc Word Count. It outputs 3 figures: the number of lines, words, and bytes in a file (or the standard input). To learn more about a command in UNIX, use the online manual, which can be read with the man command - for example, to learn more about sort, type: man sort The perl interpreter has particulary extensive online documentation - better than most manuals. 3 2 Search commands: grep and egrep There are many tools to search for patterns in UNIX; the simplest and more popular are grep and egrep. Both can search for patterns either in a list of files, using the syntax: egrep PATTERN FILE1 FILE2 where PATTERN is a regular expression like those discussed in class, or from the standard input - useful if you want to use them in a pipeline. For example, if you want to see how many files with suffix ‘.txt’ you have in the current directory, you can use the following pipeline, which first of all lists the contents of the present directory in ‘long’ format using the command ls (similar to dir in Windows), and then counts how many lines there are in the output of egrep using the command wc -l: ls -l | egrep ’\.txt’ | wc -l Notice that egrep is used here as a ‘filter’. For many more examples of use of grep and related commands, check out the very useful online introduction written by B. Dowling: http://www-uxsup.csx.cam.ac.uk/courses/Text/chap0.ps Exercise: Using egrep, search for lines containing eithere ’text mining’ or ’data mining’ in the file HLT data mining.txt. 3 Perl Perl, of which use we saw an example earlier, has excellent online manuals: check out man perl so we won’t say much about it here, except that there are three basic ways of using it. If all you want is to execute a Perl command on all lines of a file, you can do what we did in the examples above. If you want to do something more complex, you can write a program in Perl, save it in a file (say, with suffix ‘.pl’ or ‘.perl’) and then invoke it as follows: perl FILE.pl where FILE is the program that you just wrote. For example, the following highly complex Perl program prints out “Hello, world!”: # hello.pl # A complex Perl program print "Hello, world!\n"; 4 Save it in a file - say, ’hello.pl’ and then try to execute it as said above. Alter- natively, you can put the following in the first line: #!/usr/local/bin/perl This is an instruction to the shell to treat this file as an executable program (it’s important this is the FIRST line!). Once you made this change, AND allowed yourself and others to execute the file, as follows: chmod ugo+x hello.pl You can then execute the program by simply typing its name on the command line, i.e.,: hello.pl Exercise: Write a perl program that does what the simple ‘tokenizer’ did in a previous example, and can be executed from the command line. Put this com- mand in a pipeline in place of the calls to perl in the examples above. Hint: in order to execute Perl commands on each line of input, include them in the following while loop: while () { .... PUT YOUR COMMANDS HERE } 4 Shell scripts Shell scripts are the UNIX equivalent of ‘batch programs’ in Windows. Shell scripts are particularly useful to combine commands like the one we have seen in this lab into a single command. Instead of typing every time the sequence of commands we were trying in the examples above, we could write a shell script that does that (the shell script below, as all the other examples below, can be found in the web site, in the folder code): #!/bin/sh # extract_tokens.sh # Extract tokens from the file HLT_data_mining.txt cat HLT_data_mining.txt | \ tr "A-Z" "a-z" | \ perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | \ perl -ne’$_ =~ s/([\.:\;\,\?\!\"])/\n$1/g; print $_;’ | \ sort | \ uniq 5 Notice a few key points about this program. First of all, the first line uses the same method discussed above when talking about Perl to tell the shell that this is an executable program - except that the interpreter this time is the Bourne Shell, /bin/sh: #!/bin/sh (Sometimes this doesn’t work because of the way the shell is set up - in these cases, just pass the script as an argument to the shell.) Secondly, notice that in order to put the commands in the pipeline on a separate line, I added a backslash (‘\’) at the end of each line. This shell script is not very useful: it can only be used to extract tokens from the file HLT_data_mining.txt. A more useful script would take the name of the file from the command line. The following script illustrates the use of the for command of the Bourne shell to loop over all the arguments, as well as the use of variables: #!/bin/sh # et_uniq.sh # Extract tokens from files in the command line # Usage: et_uniq.sh FILE FILE for arg do cat $arg | \ tr "A-Z" "a-z" | \ perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | \ perl -ne’$_ =~ s/([\.:\;\,\?\!\"])/\n$1/g; print $_;’ | \ sort | \ uniq done 5 Counting words using standard UNIX tools Now that we looked at tokenization, let us go into what we are mostly interested in, counting. Again, you can do quite a lot of counting using only standard Unix.1 In order to do this, you need to: 1. Tokenize the text, 2. Sort it, 3. Count duplicates. 1A lot of the ideas in the following sections come from Ken Church’s Unix for Poets, available on the Web. 6 We have seen before how to write a tokenizer in Perl. This can be done a bit more simply (although not as precisely) with tr, identifying the strings consisting only of alphabetic characters and put each of them in a separate line using the following idiom with tr: tr -sc ’A-Za-z’ ’\012’ This command TRanslates every character NOT in the set ’A-Za-z’ into a newline, and Squeezes extra blank lines. Exercise: Earlier on you should have copied the file HLT data mining.txt into your own directory. (If you lost it, it’s in the web page, under data/HLT data mining.txt; copy it now.) Run tr with the options just seen on your file, and look at the output. You can then pipe the output of this call to tr to sort, and finally you can use uniq -c to count the tokens. Exercise: Run the whole pipeline on HLT data mining.txt, save the output in a file, and look at the counts. Which word has the highest count? Exercise: You will have noticed that the pipeline we used here suffers from having a too basic tokenizer. For example, you get separate counts for The and the. Modify the pipeline to use the tokenizer we developed previously. 6 Counting bigrams using standard Unix tools Bigram counts, as well, can be obtained using just standard Unix tools (includ- ing paste and tail). The basic algorithm for doing this is: 1. tokenize (as above) 2. print wordi and wordi+1 on the same line 3. count We have now seen many ways of doing the first step. The second can be done in two substeps: first create an ‘offset’ file using tail–a file which contains all but the first word of the original file. Then use paste to create a third file whose lines contain one word from the original file and one from the second one. In summary, tr -sc ’A-Za-z’ ’\012’ < HLT_data_mining.txt > HLT_dm_tokens.txt tail +2 HLT_dm_tokens.txt > HLT_dm_nexttokens.txt paste HLT_dm_tokens.txt HLT_dm_nexttokens.txt > HLT_dm_bigrams.txt cat HLT_dm_bigrams.txt | sort | uniq -c 7 7 Estimating probabilities and smoothing To get a simple idea of how these unigrams and bigrams are used to build lan- guage models, look at the two perl scripts MP unigrams.pl and MP bigrams.pl (in the code folder). These two scripts expect as input lists of unigram and bigram tokens as produced by the methods above, and estimate unigram and bigram probabilities using MLE. In order to exemplify smoothing, two addi- tional tokens are considered. (In the real world, of course, we would use a dictionary to find missing n-grams.) 8 Using a more realistic corpus Of course, the file you have been using so far cannot really be considered a realistic corpus! These days, the types of corpora used in NLE applications are at least 100M. Working with this files of this size introduces lots of additional problems, however. So, we’ll do something simpler, but that should still give you a feeling of what a ’real’ estimation exercise looks like. Get the files training.txt and test.txt from the web page, directory data. The two files contain the text of all the Jane Austen novels (from the Manning and Schuetze’s companion web site to their textbook, Foundations of Statistical NLP). Exercise: Use the programs discussed earlier to estimate unigram and bigram probabilities in this corpus. 8