Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
CC437 - Lab 2
NLE in Linux
Massimo Poesio
October 26th, 2004
1 Objectives of this Lab
Linux comes with a lot of software for manipulating text. The purpose of this
lab is to get you acquainted with Linux, and to introduce the software that can
be used for NL tasks.
2 A Quick Introduction to Linux
The notes in this section are meant as a quick introduction to the Linux en-
vironment - more precisely, to the KDE window interface to the Suse Linux
operating system. You should check that you know how to do all that follows;
if you do, continue to the next section. For a more extensive introduction, you
should look at the manual pages (see below) .
2.1 Logging In
In order to log in, you need to type in your username and password just as you
would on a Windows machine - the only difference is that Linux doesn’t print
out anything when you type in your password (unlike Windows, that prints out
asterisks). After you log in, your screen should look a bit as in Figure 1.
Exercise 1: Make sure you can log in!!
Much as in a Windows machine, in the KDE window interface you have a
panel which you use to start commands and where your icons disappear. You
should also see a terminal window, from which you can start the programs
you’ll need for the assignment - e.g., fsa, the Brill tagger, and the Porter stemmer
(which are discussed in a separate class).
The symbols in the panel that you need to know about are:
• The ’Application Starter’, indicated by a large ’K’ and should be the left-
most symbol in the panel. This is a bit like the ’Start’ button in aWindows
machine - if you click on it a menu should appear which contains a list
1
Figure 1: The Initial Screen
of applications, with lots of submenus. The most important applications
should already have icons on the panel (and you can put them there if
they don’t), but you can use the application starter for the rest.
• The ’File Manager’, also called ’Konqueror’. The icon for this is a little
house in front of a folder / directory (this is because this will browse
through your home folder).1 . You can use this to look at your directory,
just as you would do with the File Browser under Windows.
• The Terminal Window. The icon for this is a little shell in front of a
terminal.
• Help - the icon for this is a little red-and-white round floater. (NOT the
’Suse Help Center’, with a similar icon which is however green inside the
floater.)
Also as in Windows, you start applications by clicking on them. This will usually
open a window, which you can then shrink into an icon, enlarge, or close by
clicking at one of the three symbols on the top-right corner. (See Figure 2,
where a Terminal Window is displayed.)
After you shrink a window, you can enlarge it again by clicking at its icon
on the bottom-right corner of the KDE screen, just as in Windows.
1A bit of terminology: in UNIX, the term ’directory’ is used for what in Windows is usually
referred to as a ’folder’. We’ll use the terms interchangeably.
2
Figure 2: Windows in KDE
In the rest of this section, we’ll look at some of the basic applications.
2.2 Permissions
One important difference between Windows and Linux is that in the latter, each
file has access rights: reading, writing, and executing permission. Usually, you
are allowed to access your folders and create files in them; read, modify, and
execute your files; and read and execute some globally accessible files. On the
other hand, you are not usually allowed to read or modify other people’s files
or the global files. When you try to perform an operation on a file which you
are not allowed to, you’ll get a message like ’Permission denied’.
2.3 The Folder Editor
You use the folder editor to browse files. You get it by clicking on its icon in
the Panel, which looks as in Figure 3:
When you click on this application, you will get a window showing the files
and folders in your top directory. (See Figure 4.) You can then open them by
SINGLE clicling on them.
Exercise 2: Open a Folder editor. Create a new text file called Test.txt, and
a new directory (i.e., folder) called TestDir. Move Test.txt inside TestDir.
3
Figure 3: The Folder editor icon
Figure 4: The Folder editor window
4
2.4 The Terminal Window
The terminal window is used to execute commands, such as executing java,
wordnet (see subsequent labs) or the tagger. The terminal windows are all
named ’Konsole’ - Konsole 1, 2, 3, etc. If you don’t have a terminal window,
you get one by clicking on the shell icon in the Panel (in Figure 5):
Figure 5: The Terminal Window icon
Once opened, the window should look like the one in Figure 6.
Figure 6: The Terminal Window
Things should have been set up so that you can use the software you’ll need in
the rest of the course - e.g., the commands javac, java, perl, wn and tagger
- from a terminal window without doing anything else. To run one of these
commands, just type it in the window. E.g., to use the java compiler javac to
compile a file called HelloWorldApp.java, type javac HelloWorldApp.java in
a terminal window as shown in Figure 7. Let us know quickly if some of these
commands don’t work - e.g., if you get the response ’Command not found’ !
One thing that you have to remember about terminal windows is that at
every moment you are in a particular folder, and unless you type an absolute
5
Figure 7: Typing a command in the terminal window
pathname for a file (one which begins with a ’/’, like /usr/local/jdk/bin/java)
the program executing in the terminal window (’shell’) will look for files in the
current folder only. To find out which folder you are in, type pwd in the terminal
window:
>pwd
/usr/course/cc437
Exercise 3: Find out what your current folder / directory is.
For example, suppose that the result of Exercise 3 is that your cur-
rent directory is /usr/course/cc437/data, and you created a file called
HelloWorldApp.java in subdirectory of your current directory - i.e. a directory
called /usr/course/cc437/datacode/java. Before you can run the java com-
piler (see below) on your file, you have to make sure you are in the appropriate
directory, else you’ll get this result:
>javac HelloWorldApp.java
error: Can’t read: HelloWorldApp.java
1 error
If you want to change current directory to the subdirectory java, you have to
use the command cd followed by the name of the folder, just as under Windows:
>cd java
You can also type cd .. to move to the folder which contains the current folder,
or type an absolute pathname, as in
>cd /usr/course/cc437/code/java
6
Important: You CANNOT compile a file in a directory unless you have
writing permissions in that directory (see above). Therefore, if you cd to
/usr/course/cc437/datacode/java before attempting to use javac, you will
get an error. Copy the file in your own folders first.
2.5 Editing a File
Your assignment will involve writing a Java program or Perl script. You can
do this using a dedicated editor; else there are several editors, such as Emacs
or the KDE’s own editor. But perhaps the simplest editor is the KDE’s Text
Editor, that you get by clicking its icon on the panel (Figure 8).
Figure 8: The Text Editor icon
Once opened, the window should look like the one in Figure 9.
Figure 9: The KDE Text Editor
Doing this will allow you to start a new file, which you can then save in one
of your folders as usual. After you have created a file, you can then open it by
7
clicking on its icon in the Konqueror application.
The default text editor is quite easy to use, and very similar to Word or
Notepad. However, if you know how to use them, there are much more powerful
editors on Linux - in particular, the Emacs text editor, which you get by typing
’emacs’ in the terminal window.
Figure 10: The Emacs Text Editor
Once started, you can use the ’Files’ item in the menu at the top of the
window to open files, create new ones, or save ones. You type in just as in
any other editor. However, you can also put the editor in a ’java-mode’ or
’perl-mode’ or ’c-mode’ or ’prolog-mode’ which make it easier for you to edit
programs written in those languages.
Exercise 4: Create a java program that prints out “Hello, world” using either
the Text Editor or Emacs. Save the program in a file called HelloWorldApp.java.
2.6 Additional Documentation, Tutorials, etc.
The window interface to Linux we are using, the KDE, comes with extensive
documentation. You get the documentation by clicking on the Help icon,
Once opened, the window should look like the one in Figure 12.
8
Figure 11: The Help icon
Figure 12: The KDE Text Editor
9
Make sure you go at least through the ’Introduction to KDE’. You also get
access to the Unix manual pages.
3 Running Java
Running Java under Linux is not very different from running Java under Win-
dows. The default Java on the Linux machines is 1.4.2, just as in which includes
the Regular Expressions library. Java is in the directory
/usr/local/jdk
look at the subdirectories docs and man for information. (In Linux, you use
the command ls to check the contents of a directory - this is like dir under
Windows.) You should be able to compile and execute Java programs from a
terminal window without doing anything special. We already saw that the java
compiler is called javac, as on the Windows machines; the java interpreter is
called java, and the Java debugger jdb.
For example, suppose you compile the file HelloWorldApp.java containing
your Java program that you created in Exercise 4. You do that by moving to the
appropriate directory using cd, and then typing what follows in your terminal
window:
javac HelloWorldApp.java
This creates a file called HelloWorldApp.class, just as under Windows. Then
you can run the program using the Java interpreter java, by typing what follows
in your terminal window:
java HelloWorldApp
For a quick tutorial on using Java on Unix machines, see
http://java.sun.com/docs/books/tutorial/getStarted/cupojava/unix.html
4 Shell Scripts
shell scripts are the Unix equivalent of ‘batch programs’ in Windows - a way
of executing more than one command at a time. A shell script is just a series
of shell commands, typically stored in a file. A very simple shell script would
then look like the one in Fig. 13.
(Notice the magic incantation #!/bin/sh. Make sure you have that in the
first line.) To run this script, save it in a file (say, ’script1.sh’), then make that
file executable using chmod:
chmod ugo+x script1.sh
And then passing the command to the shell sh to execute, as follows:
10
#!/bin/sh
echo "$1"
Figure 13: First shell script
sh script1.sh ’Hello, world’
If you made the command executable and added the #!/bin/sh line at the
beginning, you can also execute the script by treating it as a command:
./script1.sh ’Hello, world’
Exercise 5: Create the shell script just discussed, make it executable, and run
it from a terminal window.
This script illustrates a couple of important points. The first is that you can
execute a shell script ’directly’, provided that you include the magic incanta-
tion #!/bin/sh (or #!/bin/csh) at the beginning; this line tells the shell that
it’s processing a script. The second is that you can access the command line
arguments using the variables $1 . . . $9.
The reason why shell scripts are useful is that you can use them to combine a
number of commands together and write a program which uses executable com-
mands as primitives, using the shell’s control commands (for, if). (Hint: this
would be a very simple way of implementing your programming assignment.)
For example, you can write a program that appends all of its arguments at the
end of a file called ’tmp’ by means of the script in Fig. 14
#!/bin/sh
for arg
do
cat "$arg" >> tmp
done
Figure 14: Second shell script
The following script concatenates the line “Dear X” to a form letter:
11
#!/bin/sh
for person in Joe Leslie Edie Allan
do
echo "Dear $person," | cat - form_letter | lpr
done
Notice the | construct. This is a pipe, which means: use the input of the
first command (echo) as the input of the second command (cat) –pipes are
discussed at length in the next section.
The if construct is particularly useful in combination with the test com-
mand that tests, for example, the existence of files:
#!/bin/sh
if test -r $HOME/.signature
then
.... Do whatever ....
else
echo "Can’t read your ’.signature.’ Quitting." 1>&2
exit 1
fi
You can use the csh shell instead of the Bourne shell sh to execute shell
scripts by replacing the first line with #!/bin/csh. For more details, look at
the manual pages for csh and sh.
5 Tokenization and Pipes
We can now start looking at an actual NL application. We will beging with a
few examples of code for one of the tasks discussed in class, text tokenization.
One example of the useful software that comes with Unix is tr. tr is a
TRanslation program - it takes two arguments, a pattern and a replacement,
which replaces all the words in the input matching the pattern. Get the file
HLT data mining.txt from the course directory, at /usr/course/cc437/data,
and type:
cat HLT data mining.txt | tr "A-Z" "a-z"
As you can see, tr used in this way converts all uppercase characters in its input
into lowercase ones.
12
An important concept used in this first example is that of pipeline–a series
of processes each of which takes some text for input, does some work with it,
and outputs something that can be in turn used by another process. What you
typed in is actually two programs: cat HLT data mining.txt, which reads a
file and prints it out; and tr, which reads its STANDARD INPUT and outputs
a modified version of what it reads. A lot of UNIX software can be used in
this way; UNIX makes this easy to do because all shells - the programs that
read your input from the terminal windows - understand commands like the one
above, using the special ‘pipe’ symbol |:
program1 | program2 | program3
What this means is: take the output of program1, and feed it to program2 as
input; the output of program2 will in turn serve as the input to program3. A
lot of text processing can be done in UNIX simply by creating a pipeline which
connects two or more commands.
More in general, tr can be used to perform all sorts of transformations on
its input. Another popular use of tr is to find the words contained in a text,
especially in combination with uniq (see below). This is done using the following
magic incantation:
cat HLT\_data\_mining.txt | tr -sc ’A-Za-z’ ’\012’
Exercise 6: Try this, and then try to understand the result you get by looking
at the tr online manual page, which, as you know, can be found by typing man
tr in a terminal window.
A third example of pipeline is the following, in which the output of the
previous pipeline is fed to the perl interpreter to obtain a crude tokenizer.
(Make sure that everything is entered as part of a single line, else the pipeline
will be ‘broken’ !)
cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~ s/\s+/\n/g; print $_;’
In this example, perl is used from the command line. The -ne option means
“perform the specified command on every line of input”. What the command
in quotes says is: print the current line ($ ) after replacing (using the s/../../
construct) every sequence of white space (recall that the metacharacter \s means
‘any white space’, and + means ‘1 or more repetitions’) with a newline character.
The result of this pipeline should be a file with a separate word on each line.
(Notice how many of these ‘tokens’ are not really words you’d find in a lexicon
... )
Exercise 7: The ‘tokenizer’ above is slightly less crude than what we did with
tr before, but still pretty basic. For example, in the output of this pipeline
punctuation is still attached to the preceding word (check out for example the
line with ‘machines’). Can you think of a way of fixing this? (Hint: just add
one more step to the pipeline ... )
13
Another useful UNIX program is sort, which can sort its input according
to different criteria, and output the result of this sort. Try to add sort at the
‘end’ of the pipelines above - first after just tr, then after the ‘tokenizer’. E.g.,
type:
cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~
s/\s+/\n/g; print $_;’ | sort
(Make sure you also try to add sort after the ‘punctuation remover’ that you
developed as part of the exercise above.)
And finally, we can remove duplicate lines, using another Unix standard
command, uniq:
cat HLT_data_mining.txt | tr "A-Z" "a-z" | perl -ne’$_ =~
s/\s+/\n/g; print $_;’ | perl -ne’$_ =~ s/([\.:\;\?])/\n$1/g; print
$_;’ | sort | uniq
uniq can also be used to count the number of instances of each token that we
found - try to use uniq -c instead of uniq at the end of the pipeline above.
Another command that can be used to comment things is wc. suppose you
want to count how many files you have in the current directory. You could
then run the ls -l command, which outputs a list of all the files in the current
directory one per line (ls is similar, but more powerful than, the dir program
available under DOS), and then count how many lines there are in the output
of the program using the command wc -l:
ls -l | wc -l
A slightly more sophisticated version of this is a program that counts all the
JAVA files in the current directory - i.e., all the files with suffix .java. This can
be done by adding a filter program in between ls and wc in the pipeline we just
saw - i.e., a program that selects some of the lines output by ls. One program
that can be used to do this is egrep. egrep, further discussed below, takes as
input a regular expression like those discussed in class, and outputs the lines of
input that contain words that match that expression:
ls -l | egrep java | wc -l
Here is a short list of some of the Unix commands that are most useful for
text processing:
cat This command simply outputs the contents of the file, and is useful to ‘start’
pipelines, as seen in the example above - i.e., to feed input to commands
like tr that do not take their input from files, but from the standard
output.
egrep, grep For search. See below.
perl A very general text processor. See below.
14
sort Sorts lines in alphabetical order.
tr Performs various transformations on its input.
uniq Removes duplicate lines (if they occur one after the other).
wc Word Count. It outputs 3 figures: the number of lines, words, and bytes in
a file (or the standard input).
To learn more about a command in UNIX, use the online manual, which can be
read with the man command - for example, to learn more about sort, type:
man sort
The perl interpreter has particulary extensive online documentation - better
than most manuals.
6 Search commands: grep and egrep
There are many tools to search for patterns in UNIX; the simplest and more
popular are grep and egrep. Both can search for patterns either in a list of
files, using the syntax:
egrep PATTERN FILE1 FILE2
where PATTERN is a regular expression like those discussed in class, or from
the standard input - useful if you want to use them in a pipeline. For example,
if you want to see how many files with suffix ‘.txt’ you have in the current
directory, you can use the following pipeline, which first of all lists the contents
of the present directory in ‘long’ format using the command ls (similar to dir
in Windows), and then counts how many lines there are in the output of egrep
using the command wc -l:
ls -l | egrep ’\.txt’ | wc -l
Notice that egrep is used here as a ‘filter’. For many more examples of use
of grep and related commands, check out the very useful online introduction
written by B. Dowling:
http://www-uxsup.csx.cam.ac.uk/courses/Text/chap0.ps
Exercise 8: Using egrep, search for lines containing eithere ’text mining’ or
’data mining’ in the file HLT data mining.txt.
15
7 Perl
We already discussed Perl in the lectures and in the previous lab, so you should
be pretty familiar with it. Perl is used under Linux much as it is used under
Windows, but from Linux its online manual can be consulted from the command
line, using:
man perl
so we won’t say much about it here, except that there are three basic ways of
using it.
If all you want is to execute a Perl command on all lines of a file, you can do
what we did in the examples above. If you want to do something more complex,
you can write a program in Perl, save it in a file (say, with suffix ‘.pl’ or ‘.perl’)
and then invoke it as follows:
perl FILE.pl
where FILE is the program that you just wrote. For example, the following
highly complex Perl program prints out “Hello, world!”:
# hello.pl
# A complex Perl program
print "Hello, world!\n";
Save it in a file - say, ’hello.pl’ and then try to execute it as said above. Alter-
natively, you can put the following in the first line:
#!/usr/local/bin/perl
This is an instruction to the shell to treat this file as an executable program (it’s
important this is the FIRST line!). Once you made this change, AND allowed
yourself and others to execute the file, as follows:
chmod ugo+x hello.pl
You can then execute the program by simply typing its name on the command
line, i.e.,:
hello.pl
16
Exercise 9: Write a perl program that does what the simple ‘tokenizer’ did
in a previous example, and can be executed from the command line. Put this
command in a pipeline in place of the calls to perl in the examples above. Hint:
in order to execute Perl commands on each line of input, include them in the
following while loop:
while () {
.... PUT YOUR COMMANDS HERE
}
8 Shell scripts
We talked earlier on about shell scripts, the Linux equivalent of ‘batch programs’
in Windows. Shell scripts are particularly useful to combine commands like the
one we have seen in this lab into a single command. Instead of typing every
time the sequence of commands we were trying in the examples above, we could
write a shell script that does that:
#!/bin/sh
# extract_tokens.sh
# Extract tokens from the file HLT_data_mining.txt
cat HLT_data_mining.txt | \
tr "A-Z" "a-z" | \
perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | \
perl -ne’$_ =~ s/([\.:\;\?])/\n$1/g; print $_;’ | \
sort | \
uniq
Notice a few key points about this program. First of all, the first line uses the
same method discussed above when talking about Perl to tell the shell that this
is an executable program - except that the interpreter this time is the Bourne
Shell, /bin/sh:
#!/bin/sh
Secondly, notice that in order to put the commands in the pipeline on a separate
line, I added a backslash (‘\’) at the end of each line.
This shell script is not very useful - it can only be used to extract tokens
from the file HLT_data_mining.txt. A more useful script would take the name
of the file from the command line. The following script illustrates the use of the
for command of the Bourne shell to loop over all the arguments, as well as the
use of variables:
17
#!/bin/sh
# et.sh
# Extract tokens from files in the command line
# Usage: et.sh FILE FILE
for arg
do
cat $arg | \
tr "A-Z" "a-z" | \
perl -ne’$_ =~ s/\s+/\n/g; print $_;’ | \
perl -ne’$_ =~ s/([\.:\;\?])/\n$1/g; print $_;’ | \
sort | \
uniq
done
18