CS540S22 CS540 Summer 2022 Epic Section Prev: P1 Next: P3 Back to week 2 page: Link P2 Programming Problem Instruction • Enter your ID (the wisc email ID without @wisc.edu) here: and click Confirm (or hit the "Enter" key) 1,2,3,4,5,6,7,8,9,10,11 p2 • The official deadline is June 27, late submissions within a week will be accepted without penalty, but please submit a regrade request form: Link. • The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page. • Please do not refresh the page: your answers will not be saved. • You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so. • Please report any bugs on Piazza. Warning: please enter your ID before you start! 5:tree_full;7:label_full;8:tree_pruned;9:label_pruned • (Introduction) In this programming homework, you will build decision stumps and a decision tree to diagnose whether a patient has some disease based on their symptoms and medical test results. Unfortunately, we do not have a nice dataset on COVID-19, so we will use the Wisconsin Breast Cancer dataset. Your models will read in integer-valued patient data and output a diagnosis of whether the patient has breast cancer. • (Part 1) Go to the website: Dataset, click on "Data Folder" and download "breast-cancer-wisconsin.data". Read the dataset description to figure out which variables are features and which variable is the label. There are lines containing "?", you could either remove them or replace them by a constant when parsing them into the feature matrix. The list of variables is copied below: 1. Sample code number: id number 2. Clump Thickness: 1 - 10 3. Uniformity of Cell Size: 1 - 10 4. Uniformity of Cell Shape: 1 - 10 5. Marginal Adhesion: 1 - 10 6. Single Epithelial Cell Size: 1 - 10 7. Bare Nuclei: 1 - 10 8. Bland Chromatin: 1 - 10 9. Normal Nucleoli: 1 - 10 10. Mitoses: 1 - 10 11. Class: (2 for benign, 4 for malignant) • (Part 1) Train a binary decision stump (decision tree with depth 1) using the following feature: ? (indexed according to the above list). Report the counts and the information gain. Since the features are integer-valued, you could either try all binary splits and find the one with the maximum information gain, or you could use the real-valued decision tree learning algorithm discussed in the lecture. • (Part 2) Train a binary decision tree using the following features: ? (indexed according to the same list). Report the tree using the following format: • Note: make sure you only use "x? <= integer" as the condition and you only return "2" or "4". Spaces and indentations do not matter. You can use any tie breaking rule you like for comparing information gain and finding the majority label. • (Part 2) Classify the following patients using your tree. This is the test set. You can either use the Download button to download a text file, or copy and paste from the text box into Excel or a CSV file. Please do not change the content of the text box. • (Part 2) Prune the tree so that the maximum depth is ?. The root is at depth 0. You could do this with or without a validation set, that is (1) with a validation set: decide whether to make a split by comparing the validation accuracy with and without the split; (2) without a validation set: keep splitting until you reach the maximum depth. Question 1 • [1 points] Enter the total number of positive and negative instances in the training set (two integers, comma-separated, in the order, benign, malignant). Hint • Count the number of 2's and 4' in the training set. • Answer: . Question 2 • [2 points] Enter the initial entropy at the root before the split (one number, rounded to 4 decimal places). Hint • See Lecture 6 slides for the formula for entropy. Say the integer you entered in the previous question is \(n_{2}, n_{4}\) and let \(n = n_{2} + n_{4}\), then the entropy is, \(H\left(Y\right) = - \dfrac{n_{2}}{n} log_{2}\left(\dfrac{n_{2}}{n}\right) - \dfrac{n_{4}}{n} log_{2}\left(\dfrac{n_{4}}{n}\right)\). • Answer: . Question 3 • [1 points] For the decision stump (Part 1), enter the number of positive and negative instances in the training set above and below the threshold (four integers, comma-separated, in the order: below-benign, above-benign, below-malignant, above-malignant). Hint • Suppose you split your variable \(x_{i}\) at \(t \in \left\{1, 2, ..., 9\right\}\), then the four numbers you enter should be: (1) \(n_{2-}\) the number of 2's with \(x_{i} \leq t\). (2) \(n_{2+}\) the number of 2's with \(x_{i} > t\). (3) \(n_{4-}\) the number of 4's with \(x_{i} \leq t\). (4) \(n_{4+}\) the number of 4's with \(x_{i} > t\). • In this case, the conditional entropy will be, \(H\left(Y | X_{i}\right) = - \dfrac{n_{2-}}{n} log_{2}\left(\dfrac{n_{2-}}{n_{-}}\right) - \dfrac{n_{2+}}{n} log_{2}\left(\dfrac{n_{2+}}{n_{+}}\right) - \dfrac{n_{4-}}{n} log_{2}\left(\dfrac{n_{4-}}{n_{-}}\right) - \dfrac{n_{4+}}{n} log_{2}\left(\dfrac{n_{4+}}{n_{+}}\right)\). • The information gain will be, \(I\left(Y | X_{i}\right) = H\left(Y\right) - H\left(Y | X_{i}\right)\). • For this question, you do not have to find the optimal split, but you will need to write the code for Part 2 anyways. One simple brute force way to do it is to compute the information gain for each of \(t = 1, 2, ..., 9\) and find the \(t\) corresponding to the largest information gain. A more efficient way is to list the unique values of \(x_{i}\) and only compute the information gain for those values of \(t\). • Answer: . Question 4 • [2 points] For the decision stump (Part 1), enter the information gain after the split (one number, rounded to 4 decimal places). Hint • See the hint for the previous question. • Answer: . Question 5 • [5 points] Input the binary decision tree in the format described previously. Hint • You should not split according to the order in the list of features, you still have to find the feature (in the list) corresponding to the max information gain at each split. • You can split the same feature multiple times. • Basically, you should solve Part 1 for each \(x_{i}\) in your list, and find the \(i\) corresponding to the largest information gain. • Given the optimal feature-split pair \(i, t\), split the dataset into two subsets, one containing all instances with \(x_{i} \leq t\) and the other containing all items with \(x_{i} > t\), and repeat the same process for the two subsets. • Stop splitting when all instances in the subset has the same label. • Stop splitting and use the majority label in the current subset if maximum information gain is 0. Now you can use your tree to classify the following patient: Feature vector (10 numbers, comma separated): OR (make sure you leave the above text field blank): 1. Sample code number: 2. Clump Thickness: 3. Uniformity of Cell Size: 4. Uniformity of Cell Shape: 5. Marginal Adhesion: 6. Single Epithelial Cell Size: 7. Bare Nuclei: 8. Bland Chromatin: 9. Normal Nucleoli: 10. Mitoses: Classify Label: ?. Corresponding feature vector: Copy. Question 6 • [2 points] Enter the maximum depth of this tree. The root is at depth 0. For example, if you only have "if ..., else ...", you should enter 1. Hint • Just making sure that we are counting the depth of the tree the same way. • Answer: Question 7 • [15 points] Input the class labels on the test set (200 integers, either 2 or 4, comma separated, in one line). Hint • Start from the root, if \(x_{i} \leq t\), go to the left subtree, otherwise, go the right subtree. Stop if the subtree is a leaf and return the label. Question 8 • [5 points] Input the pruned binary decision tree in the format described previously. Hint • To prune the tree without a validation set, you need to keep track of the current depth and once the maximum depth is reached, make a leaf and use the majority label. • In practice, you should not use the test set in training and validation, but for the purpose of this homework, you can prune the tree using the test set: *spoiler* the first 100 instances has label 2 and the last 100 instances has label 4. • To prune the tree using the test set, you evaluate the accuracy of each subtree: if the accuracy is lower than classifying all instances as 2 (or classifying all instances as 4), you should replace this subtree by a leaf and use the majority label; if the accuracy is higher, continue pruning the subtrees. Now you can use your pruned tree to classify the following patient: Feature vector (10 numbers, comma separated): OR (make sure you leave the above text field blank): 1. Sample code number: 2. Clump Thickness: 3. Uniformity of Cell Size: 4. Uniformity of Cell Shape: 5. Marginal Adhesion: 6. Single Epithelial Cell Size: 7. Bare Nuclei: 8. Bland Chromatin: 9. Normal Nucleoli: 10. Mitoses: Classify Label: ?. Corresponding feature vector: Copy. Question 9 • [15 points] Input the class labels on the test set (200 integers, either 2 or 4, comma separated, in one line). Hint • Start from the root, if \(x_{i} \leq t\), go to the left subtree, otherwise, go the right subtree. Stop if the subtree is a leaf and return the label. Question 10 • [1 points] Please confirm that you are going to submit the code on Canvas under Assignment P2, and make sure you give attribution for all blocks of code you did not write yourself (see bottom of the page for details and examples). I will submit the code on Canvas. Question 11 • [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank. • Answer: . Grade Grade ***** ***** ***** ***** ***** ***** ***** ***** ***** ***** Submission • Please do not modify the content in the above text field: use the "Grade" button to update. • Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times. Check the box to confirm submission. Submit • You could submit multiple times (but please do not submit too often): only the latest submission will be counted. • Please also save the text in the above text box to a file using the button Download or copy and paste it into a file yourself Copy. You can also include the resulting file with your code on Canvas Assignment P2. • You could load your answers from the text (or txt file) in the text box below using the button Load. The first two lines should be "##p: 2" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them. Load • Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times. Solutions • The sample solution in Java and Python will be posted around the deadline. You are allowed to copy and use parts of the solution with attribution. You are allowed to use code from other people (with their permission) and from the Internet, but you must and give attribution at the beginning of the your code. MOSS will be used for code plagiarism check: blocks of copied code without attribution will result in a zero for the whole assignment. For example, you can put the following comments at the beginning of your code: % Code attribution: (TA's name)'s P2 example solution. % Code attribution: (student name)'s P2 solution. % Code attribution: (student name)'s answer on Piazza: (link to Piazza post). % Code attribution: (person or account name)'s answer on Stack Overflow: (link to page). • Sample solution from last year: 2020 P2. The homework is slightly different, please use with caution. • Sample solution: Java: TBA Python: TBA • You can get help on understanding the algorithm from any of the office hours; to get help with Python, please go to the TA's office hours; to get help with Java, please go to the instructor's office hours. For times and locations see: Home Page. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code. Last Updated: June 16, 2022 at 9:07 PM Home Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 9 Week 10 Week 11 Week 12 Week 13 Midterm Final