School of Computing and Information Technology Session: Spring 2023 University of Wollongong Lecturer: Janusz R. Getta ISIT912 Big Data Management Assignment 1 Published on 24 July 2023 Scope This assignment includes the tasks related to implementation of HDFS application and implementation MapReduce applications. This assignment is due on Saturday, 19 August 2023, 7:00pm (sharp). This assignment is worth 10% of the total evaluation in the subject. The assignment consists of 4 tasks and specification of each task starts from a new page. Only electronic submission through Moodle at: https://moodle.uowplatform.edu.au/login/index.php will be accepted. A submission procedure is explained at the end of Assignment 1 specification. A policy regarding late submissions is included in the subject outline. Only one submission of Assignment 1 is allowed and only one submission per student is accepted. A submission marked by Moodle as "late" is always treated as a late submission no matter how many seconds it is late. A submission that contains an incorrect file attached is treated as a correct submission with all consequences coming from the evaluation of the file attached. All files left on Moodle in a state "Draft(not submitted)" will not be evaluated. A submission of compressed files (zipped, gzipped, rared, tared, 7-zipped, lhzed, … etc) is not allowed. The compressed files will not be evaluated. An implementation that does not compile well due to one or more syntactical and/or run time errors scores no marks. The first assignment is an individual assignment and it is expected that all its tasks will be solved individually without any cooperation with the other students. However, it is allowed to declare in the submission comments that a particular component or task of this assignment has been implemented in cooperation with another student. In such a case evaluation of a task or component may be shared with another student. In all other cases plagiarism will result in a FAIL grade being recorded for entire assignment. If you have any doubts, questions, etc. please consult your lecturer or tutor during laboratory/tutorial classes or over e-mail. Task 1 (1 mark) Merging files in HDFS Read an analyse HDFS applications provided in the files FileSystemCat.java and FileSystemPut.java and available in a folder Resources attached to a specification of laboratory class for Week2 on Moodle. Use the applications FileSystemCat.java and FileSystemPut.java to implement in Java HDFS application, that merges two files located in HDFS into one file also located in HDFS. The application must have the following parameters. (1) A path to, and a name of the first input file in HDFS. (2) A path to, and a name of the second input file in HDFS. (3) A path to, and a new name of an output file to be created in HDFS. The file supposed to contain the contents of the first input file followed by the contents of the second input file. Implement the application and save its source code in a file solution1.java. Upload to two files to HDFS. The contents, the name, and the locations of the files in HDSF are up to you. When ready, compile, create jar file, and process your application. Display the results created by the application. Use Hadoop to provide an evidence, that two files uploaded into HDFS has been successful merged in one file in HDFS. Deliverables A file solution1.txt that contains a listing of source code of your application, a report from compilation, creation of jar file, uploading to HDFS two small files for testing, listing of both files in HDFS, processing of the application and an evidence that that two files uploaded into HDFS has been successful merges in one file in HDFS. A file solution1.txt must be created through Copy/Paste of the contents of Terminal window into a file solution1.txt. No screen dumps are allowed and no screen dumps will be evaluated. Task 2 (2 marks) Implementation of a simple MapReduce application Read an analyse MapReduce application provided in a file Filter.java available in a folder Resources attached to a specification of laboratory class for Week3 on Moodle. The application has the functionality equivalent to the functionality of the following SQL statement: SELECT key, value FROM sequence-of-key-value-pairs WHERE value > given-value; An objective of this task is to use the Java code provided in a file Filter.java to implement a MapReduce application Solution2 that has the functionality equivalent to the functionality of the following SQL statement: SELECT item-name, price-per-unit * total-units FROM sales.txt WHERE price-per-unit * total-units > given-value; A single line in an input data set sales.txt must have the following format. item-name price-per-unit total-units For example: bolt 2 25 washer 3 8 screw 7 20 nail 5 10 screw 7 2 bolt 2 20 bolt 2 30 drill 10 5 washer 3 8 The contents of a file sales.txt is up to you as long as it is consistent with a format explained above. A value of given-value must be passed through a parameter of your program. Save your solution in a file Solution2.java. When ready list Solution2.java in Terminal window, compile, create jar file, and process the application. List an input dataset sales.txt in Terminal window and the results created by the application. When completed, Copy and Paste all messages from a Terminal screen into a file solution2.txt. Deliverables A file solution2.txt with a listing of source code of your application, report from compilation, creating jar file, processing the application, listing of a file sales.txt and listing of the results of processing of MapReduce application Solution2.java. A file solution2.txt must be created through Copy/Paste of the contents of Terminal window into a file solution2.txt. No screen dumps are allowed and no screen dumps will be evaluated. Task 3 (3 marks) Implementation of a simple MapReduce application Read an analyse MapReduce application provided in a file MinMax.java available in a folder Resources attached to a specification of laboratory class for Week3 on Moodle. The application has the functionality equivalent to the functionality of the following SQL statement. SELECT key, MIN(value), MAX(value) FROM sequence-of-key-value-pairs GROUP BY key; An objective of this task is to use the Java code provided in a file MinMax.java to implement a MapReduce application Solution3 that has the functionality equivalent to the functionality of the following SQL statement. SELECT item-name, SUM(price-per-unit * total-units) FROM sales.txt GROUP BY item-name A single line in an input data set sales.txt must have the following format. item-name price-per-unit total-units For example: bolt 2 25 washer 3 8 screw 7 20 nail 5 10 screw 7 2 bolt 2 20 bolt 2 30 drill 10 5 washer 3 8 The contents of a file sales.txt is up to you as long as it is consistent with a format explained above. Save your solution in a file Solution3.java. When ready list Solution3.java in Terminal window, compile, create jar file, and process the application. List an input dataset sales.txt in Terminal window and the results created by the application. When completed, Copy and Paste all messages from a Terminal screen into a file solution3.txt. Deliverables A file solution3.txt with a listing of source code of your application, report from compilation, creating jar file, processing the application, listing a file sales.txt and listing of the results of processing of MapReduce application Solution3.java. A file solution3.txt must be created through Copy/Paste of the contents of Terminal window into a file solution3.txt. No screen dumps are allowed and no screen dumps will be evaluated. Task 4 (4 marks) Describing MapReduce application The files orders.txt and details.txt contain information about the orders submitted by the customers and the details of each order. A single line in a file orders.txt has the following structure: order-number date customer-id For example: 0000001 12-JUN-2022 CUST-A02 0000002 12-JUN-2022 CUST-A01 0000003 13-JUN-2022 CUST-A02 0000004 15-JUL-2022 CUST-F01 0000005 16-JUL-2022 CUST-A01 A single line in a file details.txt has the following structure: order-number item price For example: 0000001 bolt 15 0000001 screw 5 0000002 screw 5 0000002 bolt 10 0000002 bigbolt 50 An objective of this task is to describe MapReduce application that computes the total amount of money spent by all customers in a given year on a given item. For example, the total amount of money spent on bolts in 2022, or the total amount of money spent on screws in 2020. A description of the application must include the following details: - preparation of data for processing, - the parameters of the application, - a detailed description of Driver, - a detailed description of Mapper, - a detailed description of Reducer, - accessing the results. The detailed descriptions of Driver, Mapper and Reducer must contain all information needed for the implementations of Driver, Mapper and reducer. You can use pseudocode whenever it is necessary. Save your description of MapReduce application that computes the total amount of money spent by all customers in a given year on a given item in a file solution4.pdf. Deliverables A file solution4.pdf with a detailed description of MapReduce application that computes the total amount of money spent by all customers in a given year on a given item. Submission of Assignment 1 Note, that you have only one submission. So, make it absolutely sure that you submit the correct files with the correct contents. No other submission is possible ! Submit the files solution1.txt, solution2.txt, solution3.txt, and solution4.pdf through Moodle in the following way: (1) Access Moodle at http://moodle.uowplatform.edu.au/ (2) To login use a Login link located in the right upper corner the Web page or in the middle of the bottom of the Web page (3) When logged select a site ISIT312/912 (S223) Big Data Management (4) Scroll down to a section Assessment items (Assignments) (5) Click at In this place you can submit the outcomes of your work on the tasks included in Assignment 1 link. (6) Click at a button Add Submission (7) Move a file solution1.txt into an area You can drag and drop files here to add them. You can also use a link Add… (8) Repeat step (7) for the remaining files solution2.txt, solution3.txt, and solution4.pdf (9) Click at a button Save changes (10) Click at the checkbox with a text attached: By checking this box, I confirm that this submission is my own work, … in order to confirm the authorship of your submission. (11) Click at a button Continue (12) Check if Submission status is Submitted for grading. End of specification