Alex Chengelis
2632220
CIS-612
LAB 4_1
1. Create a new virtual Machine with Ubuntu. I am using VMware player to do this.
Just keep filling in the info you want and let it install.
2. Download the Appropriate Java and Hadoop files.
I am using 2.7.3 since it is the latest stable release of Hadoop. You can either use the website to
download it or use curl.
Curl -O http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-
2.7.3.tar.gz
For Java Go to this page http://www.oracle.com/technetwork/java/javase/downloads/jdk8-
downloads-2133151.html
And download the linux tar.gz.
Place both the Hadoop and Java binaries in downloads.
3. Configure the SSH server.
sudo apt-get update
sudo apt-get install openssh-server
4. Configure the password-less ssh login.
cd
ssh-keygen -t rsa -P ""
cat ./.ssh/id_rsa.pub >> ./.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
##THEN
sudo service ssh restart
5. Standalone Mode Setup (you start with this and add more and more functionality). Start by
extracting the downloaded files.
cd Downloads
tar xzvf hadoop-2.7.2.tar.gz
After running the tar command the terminal will quickly fill up.
Verify that Hadoop has been extracted
6. Create soft links
cd
ln -s ./Downloads/hadoop-2.7.2/ ./Hadoop
7. Configure .bashrc
cd
vi ./.bashrc
export HADOOP_HOME=/home/alex/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
8. Configure Hadoop’s Hadoop-env.sh file
cd
vi ./hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/home/alex/jdk
9. Run a Hadoop job on a Standalone cluster. First exit and restart the terminal. Then type the
Hadoop command.
A sign that our installation is good so far.
Run a Hadoop job
Create a testhadoop directory
Create input directory inside testhadoop
Create some input files (the .xml files)
Run MapReduce example job
View the output directory using cat command
cd
mkdir testhadoop
cd testhadoop
mkdir input
cp ~/hadoop/etc/hadoop/*.xml input
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-
mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
cat output/*
You’ll see some output in the terminal
Finally check the output
This is working.
10. Now to transform this into a pseud-Distributed Mode without YARN setup (to start).
a. Configure core-site.xml and hdfs-site.xml
cd
vi ./hadoop/etc/hadoop/core-site.xml
## adding these lines to the file ##
fs.defaultFS
hdfs://10.1.37.12:9000
vi ./hadoop/etc/hadoop/hdfs-site.xml
## adding these lines to the file ##
dfs.replication
1
Replace the ip with the one from the following command
11. Format the namenode
hdfs namenode -format
12. Start/Stop Hadoop cluster
$ start-fs.sh
13. Create a user on the HDFS system
$ hdfs dfs -mkdir /user
$hdfs dfs -mkdir /user/alex
Put some info into that input
$hdfs dfs -put ~/hadoop/etc/hadoop input
14. Run a Hadoop job now
$hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-
examples-2.7.2.jar grep input output ‘d[a-z.]+’
Check the output
$hdfs dfs -cat output/*
15. Since everything is working so far we are going to extend our Pseudo-Distributed Mode with
YARN Setup.
a. Configure mapred-site.xml and yarn-site.xml
$cd
$nano ./hadoop/etc/hadoop/mapred-site.xml
Add the following lines
mapreduce.framework.name
yarn
$nano ./hadoop/etc/hadoop/yarn-site.xml
Add the following lines
yarn.nodemanager.aux-
servicesmapreduce_shuffle
16. Start YARN cluster
$start-yarn.sh
Go to http://localhost:8088 to make sure it is working
17. Let’s test.
$cd
$cd testhadoop
$rm -rf output/
$hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input
output ‘dfs[a-z.]+’
$hdfs dfs -cat output/*
Will look the same as the previous one
18. Time to run the word count.
a. Let’s get a file from the Gutenberg project: http://www.gutenberg.org/files/76/76-0.txt
it’s a copy of Huckleberry Fin
b. Use wget to get it.
c. Create a directory for our wordcount, and the input directory
$mkdir wordcount && cd wordcount
$mkdir input
d. Move our test file into the input file
e. Navigate back to the wordcount directory
$cd wordcount
f. Remove the output file currently in the system
$ hdfs dfs -rmr /user/alex/output
g. Now remove and copy over our current input directory.
$ hdfs dfs -rm -r /user/alex/input
$ hdfs dfs -put input /user/alex/input
$ hdfs dfs -ls /user/alex/input (just to check to make sure it is there)
h. Finally it is time to run the wordcount program.
$ Hadoop jar ~/Hadoop/share/Hadoop/mapreduce/Hadoop-mapreduce-examples-
2.7.3.jar wordcount input output
i. Check the output
$ hdfs dfs -cat output/*
j. Copy over the output to the “local” machine.
$ hdfs dfs -get /user/alex/output/ .
$ ls (to verify)
$ ls output (to verify)
k. Open it up in your favorite editor. Have fun looking through the results.
Guide was taken from:
https://medium.com/@luck/installing-hadoop-2-7-2-on-ubuntu-16-04-3a34837ad2db