Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Extreme computing lab exercises
Session two
Michail Basios (m.basios@sms.ed.ac.uk)
Stratis Viglas (sviglas@inf.ed.ac.uk)
1 Running a Map-Reduce program in Java
In this section, we are going to see how to write, compile and execute on Hadoop a program written
in Java. Initially, create locally a file called WordCount.java. You can obtain it (by using -copyToLocal
command) from (/user/s1250553/source/WordCount.java).
Create a folder where the compiled classes are going to be saved by using the following com-
mand:
mkdir generatedClasses
Next compile the WordCount.java file by typing (the compiled classes are saved inside generat-
edClasses/ folder):
javac -classpath /opt/hadoop/hadoop-0.20.2/hadoop-0.20.2-core.jar -d
generatedClasses/ WordCount.java
Create an executable .jar file from the generatedClasses folder as follows:
jar -cvf wordcount.jar -C generatedClasses/ .
If you are not familiar with the jar command visit this link: http://docs.oracle.com/javase/
tutorial/deployment/jar/build.html
Now we can finally run the executable wordcount.jar file on hadoop by using:
hadoop jar wordcount.jar exc.lab2.WordCount /user/s1250553/data/example1.txt
/user/Sxxxxxx/javaExampleOutput
package exc . lab2 ;
import j ava . io . IOException ;
import j ava . u t i l . ∗ ;
import org . apache . hadoop . f s . Path ;
import org . apache . hadoop . conf . ∗ ;
import org . apache . hadoop . io . ∗ ;
import org . apache . hadoop . mapred . ∗ ;
import org . apache . hadoop . u t i l . ∗ ;
publ ic c l a s s WordCount {
publ ic s t a t i c c l a s s Map extends MapReduceBase implements
Mapper {
p r i v a t e f i n a l s t a t i c I n t W r i t a b l e one = new I n t W r i t a b l e ( 1 ) ;
p r i v a t e Text word = new Text ( ) ;
publ ic void map( LongWritable key , Text value , OutputCollector  output ,
Reporter r e p o r t e r ) throws IOException {
S t r i n g l i n e = value . t o S t r i n g ( ) ;
S t r ingTokenizer token izer = new Str ingTokenizer ( l i n e ) ;
while ( token izer . hasMoreTokens ( ) ) {
word . s e t ( token izer . nextToken ( ) ) ;
1
output . c o l l e c t ( word , one ) ;
}
}
}
publ ic s t a t i c c l a s s Reduce extends MapReduceBase implements
Reducer {
publ ic void reduce ( Text key , I t e r a t o r  values , OutputCollector  output , Reporter r e p o r t e r ) throws IOException {
i n t sum = 0 ;
while ( values . hasNext ( ) ) {
sum += values . next ( ) . get ( ) ;
}
output . c o l l e c t ( key , new I n t W r i t a b l e (sum ) ) ;
}
}
publ ic s t a t i c void main ( S t r i n g [ ] args ) throws Exception {
JobConf conf = new JobConf ( WordCount . c l a s s ) ;
conf . setJobName ( " wordcount " ) ;
conf . setOutputKeyClass ( Text . c l a s s ) ;
conf . setOutputValueClass ( I n t W r i t a b l e . c l a s s ) ;
conf . setMapperClass (Map. c l a s s ) ;
conf . setCombinerClass ( Reduce . c l a s s ) ;
conf . setReducerClass ( Reduce . c l a s s ) ;
conf . setInputFormat ( TextInputFormat . c l a s s ) ;
conf . setOutputFormat ( TextOutputFormat . c l a s s ) ;
// Here you can s e t the number of mappers and reducers
conf . setNumMapTasks ( 1 0 ) ; // 10 mappers
conf . setNumReduceTasks ( 1 0 ) ; // 10 reducers
Fi leInputFormat . se t InputPaths ( conf , new Path ( args [ 0 ] ) ) ;
FileOutputFormat . setOutputPath ( conf , new Path ( args [ 1 ] ) ) ;
J o b C l i e n t . runJob ( conf ) ;
}
}
Run again the same task as previously but use 5 reducers instead of 10. Observe the output folderTask B
and the number of files produced.
Run the example of the wordcount problem from the first lab (Section 3.2.1) by using only twoTask B
reducers. You need to use "-D mapred.map.tasks=10" and "-D mapred.reduce.tasks=2" in the com-
mand line.
2 Managing jobs via web interface and the command line
2.1 Using the command line
Listing jobs
Running the following command will list all (completed and running) jobs on the cluster:
hadoop job -list all
This will produce a list similar to that shown in table 1. Job IDs (first column) are very important
as they uniquely identify a job. This will be useful in the next section.
Job status
To get the status of a particular job, we can use
hadoop job -status 
2
Table 1: Job listing.
14 jobs submitted
States are:
Running : 1 Succeded : 2 Failed : 3 Prep : 4
JobId State StartTime UserName Priority SchedulingInfo
job_201009151640_0001 2 1285081662441 s0894589 NORMAL NA
job_201009151640_0002 2 1285081976156 s0894589 NORMAL NA
job_201009151640_0003 2 1285082017461 s0894589 NORMAL NA
job_201009151640_0004 2 1285082159071 s0894589 NORMAL NA
job_201009151640_0005 2 1285082189917 s0894589 NORMAL NA
job_201009151640_0006 2 1285082275965 s0894589 NORMAL NA
job_201009151640_0009 2 1285083343068 s0894589 NORMAL NA
job_201009151640_0010 3 1285106676902 s0894589 NORMAL NA
job_201009151640_0012 3 1285106959588 s0894589 NORMAL NA
job_201009151640_0013 3 1285107094387 s0894589 NORMAL NA
job_201009151640_0014 2 1285107283359 s0894589 NORMAL NA
job_201009151640_0015 2 1285109169514 s0894589 NORMAL NA
job_201009151640_0016 2 1285109271188 s0894589 NORMAL NA
job_201009151640_0018 1 1285148710573 s0894589 NORMAL NA
where the  is ID of a job found in the first column of table 1. The status will show the percent
of completion of mappers and reducers, along with a tracking URL and the location of a file with all
the information about the job. We will soon see that the web interface provides much more details.
Killing jobs
To kill a job, run the following command:
hadoop job -kill 
where  is the ID of the job you want to kill.
To try this, copy the file /user/s1250553/source/sleeper.py somewhere on your local filesys- C Task
tem (NOT HDFS) and run it as a streaming job (specifying sleeper.py as the mapper, with no reduc-
ers, remember to use the -file option). You can set the input to be anything you like because the
mapper doesn’t actually do anything (except wait for you to kill it), and also remember to set the
output to be a directory on HDFS that does not exist. It might be helpful if you name your job to
something familiar – this will reduce the time needed to find its ID later. Open another terminal, log
into the Hadoop cluster, and list all the jobs. Find the ID of the job you just started (use the name
you gave it for faster searching). Find out the status of your job, and finally kill it. After you run the
kill command, look at the terminal where you originally started the job and watch the job die.
2.2 Using the web interface
All of the previous actions can also be performed using the web interface. Opening a browser and C Task
going to
http://hcrc1425n01.inf.ed.ac.uk:50030/jobtracker.jsp
will show the web interface of the jobtracker. An example of what the interface looks like is shown
in figure 1. We can see a list of running, completed, and failed jobs. Clicking on a job ID is similar to
requesting its status from the command line, but it shows much more details, including the number
of bytes read/written to the filesystem, number of failed/killed task attempts, and nice graphs of job
completion.
3 Using side information
It is often useful to package other, external files together with the job. For example, if your application
uses a dictionary or a file that stores some configuration settings, one would want these files to be
3
Figure 1: Web interface of the job tracker showing a list of running, completed, and failed jobs.
available to the program just as they would in a non-mapreduce setting. This can be achieved using
the -file option that we already used to package the source files. The following program takes a
dictionary and counts only those words that appear in the dictionary, ignoring everything else. First
copy the /user/s1250553/source/mapper-dict.py and /user/s1250553/source/reducer.py to a
LOCAL directory:
hadoop fs -get /user/s1250553/source/mapper-dict.py ~/
hadoop fs -get /user/s1250553/source/reducer.py ~/
Now copy the dictionary to a LOCAL directory(-get and -copyToLocal provide the same func-
tionality):
hadoop fs -copyToLocal /user/s1250553/data/dict.eng ~/
Run the program by typing (assuming you still have example3.txt in your input directory andTask B
that your output directory doesn’t exist):
hadoop jar /opt/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar
-input /user/sXXXXXXX/data/example3.txt
-output /user/sXXXXXXX/data/output
-mapper mapper-dict.py
-file mapper-dict.py
-reducer reducer.py
-file reducer.py
-file dict.eng
The program will use dict.eng as the dictionary and count only those words that appear in that
list. Look at the source of mapper-dict.py to see how to open the dictionary file.
4