Java程序辅导

C C++ Java Python Processing编程在线培训程序编写软件开发视频讲解

QQ：2653320439 微信：ittutor Email：itutor@qq.com

CC437 - Lab 2 NLE in Linux Massimo Poesio October 26th, 2004 1 Objectives of this Lab Linux comes with a lot of software for manipulating text. The purpose of this lab is to get you acquainted with Linux, and to introduce the software that can be used for NL tasks. 2 A Quick Introduction to Linux The notes in this section are meant as a quick introduction to the Linux en- vironment - more precisely, to the KDE window interface to the Suse Linux operating system. You should check that you know how to do all that follows; if you do, continue to the next section. For a more extensive introduction, you should look at the manual pages (see below) . 2.1 Logging In In order to log in, you need to type in your username and password just as you would on a Windows machine - the only difference is that Linux doesn’t print out anything when you type in your password (unlike Windows, that prints out asterisks). After you log in, your screen should look a bit as in Figure 1. Exercise 1: Make sure you can log in!! Much as in a Windows machine, in the KDE window interface you have a panel which you use to start commands and where your icons disappear. You should also see a terminal window, from which you can start the programs you’ll need for the assignment - e.g., fsa, the Brill tagger, and the Porter stemmer (which are discussed in a separate class). The symbols in the panel that you need to know about are: • The ’Application Starter’, indicated by a large ’K’ and should be the left- most symbol in the panel. This is a bit like the ’Start’ button in aWindows machine - if you click on it a menu should appear which contains a list 1 Figure 1: The Initial Screen of applications, with lots of submenus. The most important applications should already have icons on the panel (and you can put them there if they don’t), but you can use the application starter for the rest. • The ’File Manager’, also called ’Konqueror’. The icon for this is a little house in front of a folder / directory (this is because this will browse through your home folder).1 . You can use this to look at your directory, just as you would do with the File Browser under Windows. • The Terminal Window. The icon for this is a little shell in front of a terminal. • Help - the icon for this is a little red-and-white round floater. (NOT the ’Suse Help Center’, with a similar icon which is however green inside the floater.) Also as in Windows, you start applications by clicking on them. This will usually open a window, which you can then shrink into an icon, enlarge, or close by clicking at one of the three symbols on the top-right corner. (See Figure 2, where a Terminal Window is displayed.) After you shrink a window, you can enlarge it again by clicking at its icon on the bottom-right corner of the KDE screen, just as in Windows. 1A bit of terminology: in UNIX, the term ’directory’ is used for what in Windows is usually referred to as a ’folder’. We’ll use the terms interchangeably. 2 Figure 2: Windows in KDE In the rest of this section, we’ll look at some of the basic applications. 2.2 Permissions One important difference between Windows and Linux is that in the latter, each file has access rights: reading, writing, and executing permission. Usually, you are allowed to access your folders and create files in them; read, modify, and execute your files; and read and execute some globally accessible files. On the other hand, you are not usually allowed to read or modify other people’s files or the global files. When you try to perform an operation on a file which you are not allowed to, you’ll get a message like ’Permission denied’. 2.3 The Folder Editor You use the folder editor to browse files. You get it by clicking on its icon in the Panel, which looks as in Figure 3: When you click on this application, you will get a window showing the files and folders in your top directory. (See Figure 4.) You can then open them by SINGLE clicling on them. Exercise 2: Open a Folder editor. Create a new text file called Test.txt, and a new directory (i.e., folder) called TestDir. Move Test.txt inside TestDir. 3 Figure 3: The Folder editor icon Figure 4: The Folder editor window 4 2.4 The Terminal Window The terminal window is used to execute commands, such as executing java, wordnet (see subsequent labs) or the tagger. The terminal windows are all named ’Konsole’ - Konsole 1, 2, 3, etc. If you don’t have a terminal window, you get one by clicking on the shell icon in the Panel (in Figure 5): Figure 5: The Terminal Window icon Once opened, the window should look like the one in Figure 6. Figure 6: The Terminal Window Things should have been set up so that you can use the software you’ll need in the rest of the course - e.g., the commands javac, java, perl, wn and tagger - from a terminal window without doing anything else. To run one of these commands, just type it in the window. E.g., to use the java compiler javac to compile a file called HelloWorldApp.java, type javac HelloWorldApp.java in a terminal window as shown in Figure 7. Let us know quickly if some of these commands don’t work - e.g., if you get the response ’Command not found’ ! One thing that you have to remember about terminal windows is that at every moment you are in a particular folder, and unless you type an absolute 5 Figure 7: Typing a command in the terminal window pathname for a file (one which begins with a ’/’, like /usr/local/jdk/bin/java) the program executing in the terminal window (’shell’) will look for files in the current folder only. To find out which folder you are in, type pwd in the terminal window: >pwd /usr/course/cc437 Exercise 3: Find out what your current folder / directory is. For example, suppose that the result of Exercise 3 is that your cur- rent directory is /usr/course/cc437/data, and you created a file called HelloWorldApp.java in subdirectory of your current directory - i.e. a directory called /usr/course/cc437/datacode/java. Before you can run the java com- piler (see below) on your file, you have to make sure you are in the appropriate directory, else you’ll get this result: >javac HelloWorldApp.java error: Can’t read: HelloWorldApp.java 1 error If you want to change current directory to the subdirectory java, you have to use the command cd followed by the name of the folder, just as under Windows: >cd java You can also type cd .. to move to the folder which contains the current folder, or type an absolute pathname, as in >cd /usr/course/cc437/code/java 6 Important: You CANNOT compile a file in a directory unless you have writing permissions in that directory (see above). Therefore, if you cd to /usr/course/cc437/datacode/java before attempting to use javac, you will get an error. Copy the file in your own folders first. 2.5 Editing a File Your assignment will involve writing a Java program or Perl script. You can do this using a dedicated editor; else there are several editors, such as Emacs or the KDE’s own editor. But perhaps the simplest editor is the KDE’s Text Editor, that you get by clicking its icon on the panel (Figure 8). Figure 8: The Text Editor icon Once opened, the window should look like the one in Figure 9. Figure 9: The KDE Text Editor Doing this will allow you to start a new file, which you can then save in one of your folders as usual. After you have created a file, you can then open it by 7 clicking on its icon in the Konqueror application. The default text editor is quite easy to use, and very similar to Word or Notepad. However, if you know how to use them, there are much more powerful editors on Linux - in particular, the Emacs text editor, which you get by typing ’emacs’ in the terminal window. Figure 10: The Emacs Text Editor Once started, you can use the ’Files’ item in the menu at the top of the window to open files, create new ones, or save ones. You type in just as in any other editor. However, you can also put the editor in a ’java-mode’ or ’perl-mode’ or ’c-mode’ or ’prolog-mode’ which make it easier for you to edit programs written in those languages. Exercise 4: Create a java program that prints out “Hello, world” using either the Text Editor or Emacs. Save the program in a file called HelloWorldApp.java. 2.6 Additional Documentation, Tutorials, etc. The window interface to Linux we are using, the KDE, comes with extensive documentation. You get the documentation by clicking on the Help icon, Once opened, the window should look like the one in Figure 12. 8 Figure 11: The Help icon Figure 12: The KDE Text Editor 9 Make sure you go at least through the ’Introduction to KDE’. You also get access to the Unix manual pages. 3 Running Java Running Java under Linux is not very different from running Java under Win- dows. The default Java on the Linux machines is 1.4.2, just as in which includes the Regular Expressions library. Java is in the directory /usr/local/jdk look at the subdirectories docs and man for information. (In Linux, you use the command ls to check the contents of a directory - this is like dir under Windows.) You should be able to compile and execute Java programs from a terminal window without doing anything special. We already saw that the java compiler is called javac, as on the Windows machines; the java interpreter is called java, and the Java debugger jdb. For example, suppose you compile the file HelloWorldApp.java containing your Java program that you created in Exercise 4. You do that by moving to the appropriate directory using cd, and then typing what follows in your terminal window: javac HelloWorldApp.java This creates a file called HelloWorldApp.class, just as under Windows. Then you can run the program using the Java interpreter java, by typing what follows in your terminal window: java HelloWorldApp For a quick tutorial on using Java on Unix machines, see http://java.sun.com/docs/books/tutorial/getStarted/cupojava/unix.html 4 Shell Scripts shell scripts are the Unix equivalent of ‘batch programs’ in Windows - a way of executing more than one command at a time. A shell script is just a series of shell commands, typically stored in a file. A very simple shell script would then look like the one in Fig. 13. (Notice the magic incantation #!/bin/sh. Make sure you have that in the first line.) To run this script, save it in a file (say, ’script1.sh’), then make that file executable using chmod: chmod ugo+x script1.sh And then passing the command to the shell sh to execute, as follows: 10 #!/bin/sh echo "$1" Figure 13: First shell script sh script1.sh ’Hello, world’ If you made the command executable and added the #!/bin/sh line at the beginning, you can also execute the script by treating it as a command: ./script1.sh ’Hello, world’ Exercise 5: Create the shell script just discussed, make it executable, and run it from a terminal window. This script illustrates a couple of important points. The first is that you can execute a shell script ’directly’, provided that you include the magic incanta- tion #!/bin/sh (or #!/bin/csh) at the beginning; this line tells the shell that it’s processing a script. The second is that you can access the command line arguments using the variables $1 . . . $9. The reason why shell scripts are useful is that you can use them to combine a number of commands together and write a program which uses executable com- mands as primitives, using the shell’s control commands (for, if). (Hint: this would be a very simple way of implementing your programming assignment.) For example, you can write a program that appends all of its arguments at the end of a file called ’tmp’ by means of the script in Fig. 14 #!/bin/sh for arg do cat "$arg" >> tmp done Figure 14: Second shell script The following script concatenates the line “Dear X” to a form letter: 11 #!/bin/sh for person in Joe Leslie Edie Allan do echo "Dear $person," | cat - form_letter | lpr done Notice the | construct. This is a pipe, which means: use the input of the first command (echo) as the input of the second command (cat) –pipes are discussed at length in the next section. The if construct is particularly useful in combination with the test com- mand that tests, for example, the existence of files: #!/bin/sh if test -r $HOME/.signature then .... Do whatever .... else echo "Can’t read your ’.signature.’ Quitting." 1>&2 exit 1 fi You can use the csh shell instead of the Bourne shell sh to execute shell scripts by replacing the first line with #!/bin/csh. For more details, look at the manual pages for csh and sh. 5 Tokenization and Pipes We can now start looking at an actual NL application. We will beging with a few examples of code for one of the tasks discussed in class, text tokenization. One example of the useful software that comes with Unix is tr. tr is a TRanslation program - it takes two arguments, a pattern and a replacement, which replaces all the words in the input matching the pattern. Get the file HLT data mining.txt from the course directory, at /usr/course/cc437/data, and type: cat HLT data mining.txt | tr "A-Z" "a-z" As you can see, tr used in this way converts all uppercase characters in its input into lowercase ones. 12 An important concept used in this first example is that of pipeline–a series of processes each of which takes some text for input, does some work with it, and outputs something that can be in turn used by another process. What you typed in is actually two programs: cat HLT data mining.txt, which reads a file and prints it out; and tr, which reads its STANDARD INPUT and outputs a modified version of what it reads. A lot of UNIX software can be used in this way; UNIX makes this easy to do because all shells - the programs that read your input from the terminal windows - understand commands like the one above, using the special ‘pipe’ symbol |: program1 | program2 | program3 What this means is: take the output of program1, and feed it to program2 as input; the output of program2 will in turn serve as the input to program3. A lot of text processing can be done in UNIX simply by creating a pipeline which connects two or more commands. More in general, tr can be used to perform all sorts of transformations on its input. Another popular use of tr is to find the words contained in a text, especially in combination with uniq (see below). This is done using the following magic incantation: cat HLT\_data\_mining.txt | tr -sc ’A-Za-z’ ’\012’ Exercise 6: Try this, and then try to understand the result you get by looking at the tr online manual page, which, as you know, can be found by typing man tr in a terminal window. A third example of pipeline is the following, in which the output of the previous pipeline is fed to the perl interpreter to obtain a crude tokenizer. (Make sure that everything is entered as part of a single line, else the pipeline will be ‘broken’ !) cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~ s/\s+/\n/g; print $_;’ In this example, perl is used from the command line. The -ne option means “perform the specified command on every line of input”. What the command in quotes says is: print the current line ($ ) after replacing (using the s/../../ construct) every sequence of white space (recall that the metacharacter \s means ‘any white space’, and + means ‘1 or more repetitions’) with a newline character. The result of this pipeline should be a file with a separate word on each line. (Notice how many of these ‘tokens’ are not really words you’d find in a lexicon ... ) Exercise 7: The ‘tokenizer’ above is slightly less crude than what we did with tr before, but still pretty basic. For example, in the output of this pipeline punctuation is still attached to the preceding word (check out for example the line with ‘machines’). Can you think of a way of fixing this? (Hint: just add one more step to the pipeline ... ) 13 Another useful UNIX program is sort, which can sort its input according to different criteria, and output the result of this sort. Try to add sort at the ‘end’ of the pipelines above - first after just tr, then after the ‘tokenizer’. E.g., type: cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | sort (Make sure you also try to add sort after the ‘punctuation remover’ that you developed as part of the exercise above.) And finally, we can remove duplicate lines, using another Unix standard command, uniq: cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | perl -ne’$_ =~ s/([\.:\;\?])/\n$1/g; print $_;’ | sort | uniq uniq can also be used to count the number of instances of each token that we found - try to use uniq -c instead of uniq at the end of the pipeline above. Another command that can be used to comment things is wc. suppose you want to count how many files you have in the current directory. You could then run the ls -l command, which outputs a list of all the files in the current directory one per line (ls is similar, but more powerful than, the dir program available under DOS), and then count how many lines there are in the output of the program using the command wc -l: ls -l | wc -l A slightly more sophisticated version of this is a program that counts all the JAVA files in the current directory - i.e., all the files with suffix .java. This can be done by adding a filter program in between ls and wc in the pipeline we just saw - i.e., a program that selects some of the lines output by ls. One program that can be used to do this is egrep. egrep, further discussed below, takes as input a regular expression like those discussed in class, and outputs the lines of input that contain words that match that expression: ls -l | egrep java | wc -l Here is a short list of some of the Unix commands that are most useful for text processing: cat This command simply outputs the contents of the file, and is useful to ‘start’ pipelines, as seen in the example above - i.e., to feed input to commands like tr that do not take their input from files, but from the standard output. egrep, grep For search. See below. perl A very general text processor. See below. 14 sort Sorts lines in alphabetical order. tr Performs various transformations on its input. uniq Removes duplicate lines (if they occur one after the other). wc Word Count. It outputs 3 figures: the number of lines, words, and bytes in a file (or the standard input). To learn more about a command in UNIX, use the online manual, which can be read with the man command - for example, to learn more about sort, type: man sort The perl interpreter has particulary extensive online documentation - better than most manuals. 6 Search commands: grep and egrep There are many tools to search for patterns in UNIX; the simplest and more popular are grep and egrep. Both can search for patterns either in a list of files, using the syntax: egrep PATTERN FILE1 FILE2 where PATTERN is a regular expression like those discussed in class, or from the standard input - useful if you want to use them in a pipeline. For example, if you want to see how many files with suffix ‘.txt’ you have in the current directory, you can use the following pipeline, which first of all lists the contents of the present directory in ‘long’ format using the command ls (similar to dir in Windows), and then counts how many lines there are in the output of egrep using the command wc -l: ls -l | egrep ’\.txt’ | wc -l Notice that egrep is used here as a ‘filter’. For many more examples of use of grep and related commands, check out the very useful online introduction written by B. Dowling: http://www-uxsup.csx.cam.ac.uk/courses/Text/chap0.ps Exercise 8: Using egrep, search for lines containing eithere ’text mining’ or ’data mining’ in the file HLT data mining.txt. 15 7 Perl We already discussed Perl in the lectures and in the previous lab, so you should be pretty familiar with it. Perl is used under Linux much as it is used under Windows, but from Linux its online manual can be consulted from the command line, using: man perl so we won’t say much about it here, except that there are three basic ways of using it. If all you want is to execute a Perl command on all lines of a file, you can do what we did in the examples above. If you want to do something more complex, you can write a program in Perl, save it in a file (say, with suffix ‘.pl’ or ‘.perl’) and then invoke it as follows: perl FILE.pl where FILE is the program that you just wrote. For example, the following highly complex Perl program prints out “Hello, world!”: # hello.pl # A complex Perl program print "Hello, world!\n"; Save it in a file - say, ’hello.pl’ and then try to execute it as said above. Alter- natively, you can put the following in the first line: #!/usr/local/bin/perl This is an instruction to the shell to treat this file as an executable program (it’s important this is the FIRST line!). Once you made this change, AND allowed yourself and others to execute the file, as follows: chmod ugo+x hello.pl You can then execute the program by simply typing its name on the command line, i.e.,: hello.pl 16 Exercise 9: Write a perl program that does what the simple ‘tokenizer’ did in a previous example, and can be executed from the command line. Put this command in a pipeline in place of the calls to perl in the examples above. Hint: in order to execute Perl commands on each line of input, include them in the following while loop: while () { .... PUT YOUR COMMANDS HERE } 8 Shell scripts We talked earlier on about shell scripts, the Linux equivalent of ‘batch programs’ in Windows. Shell scripts are particularly useful to combine commands like the one we have seen in this lab into a single command. Instead of typing every time the sequence of commands we were trying in the examples above, we could write a shell script that does that: #!/bin/sh # extract_tokens.sh # Extract tokens from the file HLT_data_mining.txt cat HLT_data_mining.txt | \ tr "A-Z" "a-z" | \ perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | \ perl -ne’$_ =~ s/([\.:\;\?])/\n$1/g; print $_;’ | \ sort | \ uniq Notice a few key points about this program. First of all, the first line uses the same method discussed above when talking about Perl to tell the shell that this is an executable program - except that the interpreter this time is the Bourne Shell, /bin/sh: #!/bin/sh Secondly, notice that in order to put the commands in the pipeline on a separate line, I added a backslash (‘\’) at the end of each line. This shell script is not very useful - it can only be used to extract tokens from the file HLT_data_mining.txt. A more useful script would take the name of the file from the command line. The following script illustrates the use of the for command of the Bourne shell to loop over all the arguments, as well as the use of variables: 17 #!/bin/sh # et.sh # Extract tokens from files in the command line # Usage: et.sh FILE FILE for arg do cat $arg | \ tr "A-Z" "a-z" | \ perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | \ perl -ne’$_ =~ s/([\.:\;\?])/\n$1/g; print $_;’ | \ sort | \ uniq done 18