COMP9313 2016s2 Assignment 2 Problem 1 (5 pts): HBase Bulk Load Download the file “Comments” from: https://webcms3.cse.unsw.edu.au/COMP9313/16s2/resources/5019. The data forma of “Comments” is as below: - **comments**.xml - Id - PostId - Score - Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?" - CreationDate, e.g.:"2008-09-06T08:07:10.730" - UserId Your task is to create a table “comments” using HBase Java API, which contains three column families: “postInfo” (containing “PostId”), “commentInfo” (containing “Score”, “Text”, and “CreationDate”), and “userInfo” (containing “UserId”), and to bulk load data into “comments” from the file “Comments”. Create a class “HBaseBulkLoadComments.java” in package “comp9313.ass2” to finish this task. Compile and Test Your java code will be compiled and packaged as a jar file, and we will use the following commands to check the correctness of your solution: $ $HADOOP_HOME/bin/hadoop jar YOURJAR.jar YOURCLASS input output Please ensure that the code you submit can be compiled and packaged. Any solution that has compilation errors will receive no more than 2 points for this problem. Problem 2 (5 pts): HBase and MapReduce Read input data from table “comments” in HBase, and calculate the number of comments per UserId. Your MapReduce job should write the result to another HBase table “user_comment_stats”, with only one column family “stats” containing column “count”. Create the table using HBase Java API. Write your code in “ReadHBaseComments.java” in package “comp9313.ass2”. Compile and Test Your java code will be compiled and packaged as a jar file, and we will use the following commands to check the correctness of your solution: $ $HADOOP_HOME/bin/hadoop jar YOURJAR.jar YOURCLASS Please ensure that the code you submit can be compiled and packaged. Any solution that has compilation errors will receive no more than 2 points for this problem. Problem 3 (5 pts): Hive Download files “Votes.fmt” and “Comments.fmt” from: https://webcms3.cse.unsw.edu.au/COMP9313/16s2/resources/4732. The two files are converted from “Votes” and “Comments”, in which the fields are separated by ‘ctrl+A’ and the lines are separated by ‘\n’. The data format of “Votes.fmt” is as below: - Id - PostId - VoteTypeId - ` 1`: AcceptedByOriginator - ` 2`: UpMod - ` 3`: DownMod - ` 4`: Offensive - ` 5`: Favorite - if VoteTypeId = 5 UserId will be populated - ` 6`: Close - ` 7`: Reopen - ` 8`: BountyStart - ` 9`: BountyClose - `10`: Deletion - `11`: Undeletion - `12`: Spam - `13`: InformModerator - UserId (only for VoteTypeId 5) - CreationDate The data format of “Comments.fmt” is as below: - Id - PostId - Score - UserId - CreationDate, e.g.:"2008-09-06T08:07:10.730" - Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?" Please put the two files in the home folder of HDFS, i.e., /user/comp9313 Your tasks include: 1. (1 pt) Create a table for Votes and another table for Comments and load data into the two tables from the two files provided. 2. (2 pt) Write “Select … from …” to compute the number of comments generated by each user (the result contains two columns: UserId and number of comments). 3. (2 pt) Write “Select … from …” to find the Ids of all posts that have been favoured by more than five users (the result only contains one column: PostId). You should put everything in a Hive script “ass2.sql”, and we will use the following command to run the script and to check the results. $ $HIVE_HOME/bin/hive –f ass2.sql > hive.result Problem 4 (5 pts): Pig Your tasks include: 1. (1 pt) Load data into schemas from the files converted from Votes and Comments. Hint: use PigStorage(‘\u0001’) to delimit the fields when loading data 2. (2 pt) Compute the number of comments generated by each user (the result contains two fields: UserId and number of comments). 3. (2 pt) Find the Ids of all posts that have been favoured by more than five users (the result only contains one field: PostId). Hint: you will need to use “group by”, “count”, and the “filter” command, http://pig.apache.org/docs/r0.16.0/basic.html#filter. Use the “dump” command to print out the result of a query. You should put everything in a Pig script “ass2.pig”, and we will use the following command to run the script and to check the results. $ $PIG_HOME/bin/pig ass2.pig > pig.result Documentation and code readability Your source code will be inspected and marked based on readability and ease of understanding. The documentation (comments of the codes) in your source code is also important. Submission: Deadline: Friday 23rd September 09:59:59 Log in any CSE server (williams or wagner), and use the give command below to submit your solutions: $ give cs9313 assignment2 HBaseBulkLoadComments.java ReadHBaseComments.java ass2.sql ass2.pig Or you can submit through: https://cgi.cse.unsw.edu.au/~give/Student/give.php If you submit your assignment more than once, the last submission will replace the previous one. To prove successful submission, please take a screenshot as assignment submission instructions show and keep it by yourself. Late submission penalty 10% reduction of your marks for the 1st day, 30% reduction/day for the following days. Plagiarism: The work you submit must be your own work. Submission of work partially or completely derived from any other person or jointly written with any other person is not permitted. The penalties for such an offence may include negative marks, automatic failure of the course and possibly other academic discipline. Assignment submissions will be examined manually. Relevant scholarship authorities will be informed if students holding scholarships are involved in an incident of plagiarism or other misconduct. Do not provide or show your assignment work to any other person - apart from the teaching staff of this subject. If you knowingly provide or show your assignment work to another person for any reason, and work derived from it is submitted you may be penalized, even if the work was submitted without your knowledge or consent.