Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Syllabus Schedule Short assignments Labs Exams Course software Get help CS 10: Winter 2016 Lab Assignment 4: Huffman Coding Due: Wednesday, February 17 In this assignment, you will use Huffman encoding to compress and decompress files. You will get practice with binary trees, priority queues, maps, and file input and output. Organizational notes Note because of the midterm next Thursday you have an extra week for this assignment. As in Lab Assignment 3, you are permitted to work with one other student currently in the class. You do not have to work with someone else, but you have the option of doing so. Your partner for this assignment does not have to be the same partner you worked with for Lab Assignment 3. If you choose to work with a partner, you will both get the same grade on the assignment. If you would like to work with a partner and cannot find one, let me know and I'll try to match you up with someone else who is seeking a partner. There will be no penalty, in terms of points, for working together on this assignment. Please make sure that both of you submit the code via Canvas. If you have a partner, state your partner's name in the comments section of Canvas. You should weigh whether you will get more out of this assignment working alone or with someone else. The choice is up to you. If you choose to work with someone else, pick your partner carefully. Make sure that there are times that you are both available to work together. If you frequent the lab and you notice someone else who often is there when you are, that person might be a good choice as your partner. Huffman coding Early in the development of computing technology, computer memory was very expensive. Therefore, storing large files was discouraged, and being able to compress files to half or 60% of their original size was important. Memory is much cheaper now, but smaller hand-held computers for use in cellphones and other devices are commonplace. On these small devices, we once again encounter storage bottlenecks. We also communicate over low-bandwidth channels, such as dialup, DSL, and wireless, where smaller files mean shorter download times. The proliferation of .zip files shows that file compression is still important today. One of the earliest schemes for file compression was invented by David Huffman, who founded the Computer Science department at the University of California at Santa Cruz. Instead of using seven bits to encode each character, as ASCII does, it uses a variable-length encoding of characters. Frequently occurring characters get shorter codewords than infrequently occurring ones. A codeword is the sequence of 0s and 1s used to encode the character. Huffman encoding gives the smallest possible fixed encoding of a file. A fixed encoding means that a given letter is represented by the same codeword wherever it appears in the file. Some more recent schemes that do better than Huffman's are adaptive, which means that how a letter is encoded changes as the file is processed. Section 13.4 of the textbook contain a description of how Huffman encoding works and even gives you pseudocode for how to build the Huffman code tree. Read this section. A problem with variable-length encodings is figuring out where a codeword ends. For example, if we used 1 for E, 10 for F, 00 for X, and 01 for Y, we would be in trouble. When decoding 1001, we would be unsure whether it represents EXE or FY. Huffman encoding removes this ambiguity by producing prefix-free codes. (The textbook calls them "prefix codes," but "prefix-free" is more apt.) In a prefix-free code, for any given codeword, adding bits to the end cannot produce another codeword. Hence, no codeword is a prefix of another. When we see a series of bits that corresponds to a codeword, we know that it cannot be part of a larger codeword, and so we can safely interpret these bits as the corresponding character. So how do we implement Huffman codes? We need to generate a set of prefix-free codes whose lengths are inversely correlated with the frequency of the encoded character. There are two parts to the algorithm. First, we need to construct a binary tree using a priority queue. Then, we use the binary tree to generate the codewords. Each character in the binary tree is a leaf, and the path from the root to that character gives that character's codeword. Specifically, we label each edge with a 0 or 1, and edges going to a left child append a 0 to the codeword, with edges going to a right child appending a 1. For example, in this tree, the codeword for e is 1101: The number to the right of a colon in a leaf of this figure gives the frequency of the character. The number in an internal node gives the sum of the frequencies of the leaves in its subtree. We'll soon see why we need this information. Generate a frequency table The first step in creating the code for a given file is to determine the character frequencies. This procedure is fairly simple. First, we create a map where characters are keys and character frequencies are values. Then, we read the file one character at a time. If the character is not in the map, we add it with a frequency of 1. If the character is already in the map, we increment its frequency. When we're done reading the file, we will have a map that maps each character to the number of times that it appears in the file. (Remember that in Java, you have to use appropriate wrapper classes to store the character and the frequency, because you cannot store primitive types in a Map.) Create singleton trees Next, create a tree for each character consisting of just one node; that is, create a singleton tree for each character. The root of each tree singleton tree stores two values: a character and its frequency. Then, insert all these singleton trees into a min-priority queue, with the frequencies as keys. Use the binary tree code we saw in class (BinaryTree.java) to implement the binary trees. Because you need to store two values and a tree node has only one data value, you will have to create a class that stores a character and its frequency. This new class will be the data type for the tree. You should provide accessor methods as needed. You can use whatever priority-queue implementation you like. I found it easiest to use Java's PriorityQueue class, which implements a heap-based priority queue. Note that the things that you are comparing are the frequency counts in the root nodes of two trees. It is simplest to write a separate TreeComparator class that implements the Comparator interface. The only method that you need to implement is compare, which takes two tree nodes and returns -1, 0, or 1 depending on whether the first has a smaller frequency count, the counts are equal, or the second has the smaller frequency count. You pass a TreeComparator object as the second parameter of the PriorityQueue constructor. The first parameter is the initial size for the queue. Look up PriorityQueue in the Java documentation. Creating a single tree Our goal is to create a single tree in which the lowest-frequency characters are deepest in the tree, and hence have the longest codewords, and the highest-frequency characters are near the top of the tree and have short codewords. Use the priority queue to achieve this goal, as follows: Extract the two lowest-frequency trees T1 and T2 from the priority queue. Create a new tree T by creating a new root node r, attaching T1 as r's left subtree, and attaching T2 as r's right subtree. (Which of T1 and T2 is the left or right subtree does not matter.) Assign to the new tree T a frequency that equals the sum of the frequencies of T1 and T2. Insert the new tree T into the priority queue, with its frequency as its key. Each time steps 1–4 execute, the number of trees in the priority queue reduces by one. We keep repeating the above four steps until only one tree remains in the priority queue. This tree is our Huffman code tree. Here is an example that shows the steps. In each part, the tree roots appear in order of increasing frequency. Part (a) shows the singleton trees for the first six characters of the alphabet. Parts (b)–(f) show the result of executing steps 1–4 above, with part (f) showing the final Huffman code tree. We won't prove why this tree gives the most efficient codes for the characters, but this proof can be found online or in most algorithms textbooks. However, it should be intuitive that since we built up from the lowest frequency nodes, the lower a character's frequency, the deeper its leaf node will be in the tree. Retrieving the codes The Huffman code tree encodes all of the information about the code for each character. But given a character, it is a bother to search through the tree to find it in a leaf and trace its path from the root. To make encoding fast we want a Map that pairs characters with their codewords. That is, we want to pair each character with a string of 0s and 1s that describes the path from the root to that character. You can construct the entire map during a single traversal of the tree. You just have to keep track of a "path so far" parameter as you do the traversal. I will let you work out the details. There are less efficient ways to solve the problem, but for full credit you need to produce the code map during a single traversal of the code tree. Compressing the file To compress the input text file, repeatedly read the next character in the file, look up its codeword in the code map, and then write the sequence of 0s and 1s in that codeword as bits to an output file. Reading and writing files is not hard, but it is a bit detailed. We describe below how to do so. Decompressing the compressed file Decompressing the compressed file is similar to compression, in that you read from one file and write to another. To decode, process the compressed file bit by bit. Start at the root, and read the first bit from the compressed file. If it is a 0, go left, and if it is a 1, go right. Repeatedly read a bit and go left or right until you reach a leaf. You have now decoded your first character. Get the character out of the data value of that leaf and write it to the output file. Then go back to the root and repeat the process, starting at the next bit of the compressed file. Decompression is completed once there are no more bits to read. You can compare your input file and decompressed file to verify that they are identical. You may have noticed that we have cheated a bit: we kept the Huffman code tree that we used for compression around and later used it to decompress. A practical compression application, such as zip, has one compress method and an independent decompress method. The code tree has to be somehow included in the file during compression so that it can be read and used for decompression. You may do so for extra credit (see below). Reading and writing files For compression, you need to read characters from a file and then write bits to a different file. For decompression, you read bits from a file and write characters to a different file. Fortunately, the Java library has classes that make it easy to read characters from a file and write characters to a file. Unfortunately, the Java library does not have classes to read and write bits. To remedy this situation, I have written classes to read and write bits and provided them to you. You don't have to read the code, although you might find it interesting to do so. The tricky part is that we can read and write only bytes, and not individual bits, to files, so that if we have a number of bits that is not a multiple of 8 the last byte will be only partially full. How can we tell how many of those bits are useful and how many are garbage? Reading and writing characters To read from a file, I recommend that you use a BufferedReader. "Buffered" says that it reads a bunch of characters to a buffer and then doles them out one at a time as you ask for them, rather than going to the disk each time that you ask for a character. Disks always transfer a whole block of bytes every time they read or write, and buffered I/O speeds things up considerably. (Accessing a disk is 100,000 to 1,000,000 times slower than accessing main memory.) To read a file, start with this declaration and assignment: BufferedReader inputFile = new BufferedReader(new FileReader(pathName)); where pathName is the full name of the file on your computer. (More on path names later.) This line will open a file and save a buffer and information about the current position in the file within a BufferedReader object referenced by inputFile. To read a character, call inputFile.read(). The read method returns an int that holds the Unicode encoding of the character. You just have to cast it to be a char. When the file is empty, read returns the value  − 1, which is not a valid Unicode for any character. To write to a file, start with this declaration and assignment: BufferedWriter outputFile = new BufferedWriter(new FileWriter(decompressedPathName)); where decompressedPathName is the name of the output file that you want to create. To write a character c, call outputFile.write(c). When you have finished reading or writing a file, you must close it by calling close(). That is, call inputFile.close() or outputFile.close(). Closing a file frees up resources and cleans things up. If you don't close the output file, then the last buffer might not even get written out to the disk and the file can be left in an inconsistent state. So always close files—both input and output— when you are done with them. Reading and writing bits I have supplied the classes BufferedBitReader.java and BufferedBitWriter.java. You use them in a manner very similar to the classes above. The way to open a reader and a writer is like so: BufferedBitReader bitInputFile = new BufferedBitReader(compressedPathName); BufferedBitWriter bitOutputFile = new BufferedBitWriter(compressedPathName); To read, call bitInputFile.readBit() which return an int that equals 0 or 1 (or  − 1 when there are no more bits to read). To write, call bitOutputFile.writeBit(bit), where bit is an int that had better be equal to 0 or 1. There is also a close method for each of these classes. Call it when you are done reading or writing the file. If you do not close the output file, I can guarantee that you will not be able to correctly read it later. Handling exceptions All of the file-access methods above may potentially throw an IOException. Because IOException is a checked exception, you need to include try and catch blocks. Think carefully about where you want to catch errors. It may be easier to handle them in the main method than in a method that it calls. You should also include finally blocks to close files. For example, here is how you might read characters from the input file to compute frequencies: // Holds the text inputFile file reader. BufferedReader inputFile = new BufferedReader(new FileReader(pathName)); try { int cInt = inputFile.read(); // read the next character's integer representation while (cInt != -1) { // Code to process the character whose Unicode value is cInt goes here. cInt = inputFile.read(); // read the next character's integer representation } // return-statement for the frequency map goes here. } finally { inputFile.close(); } Getting a file's path name To get the complete path name of a file, it is easiest to use a file chooser. The following method will return the path name of the file you choose from a standard file-chooser window. import java.io.*; import javax.swing.JFileChooser; import java.lang.reflect.InvocationTargetException; import java.util.concurrent.atomic.AtomicReference; /** * Puts up a fileChooser and gets the path name for the file to be opened. * Returns an empty string if the user clicks Cancel. * @return path name of the file chosen */ public static String getFilePath() { // Create a file chooser. final AtomicReference result = new AtomicReference<>(); try { SwingUtilities.invokeAndWait(new Runnable() { public void run() { JFileChooser fc = new JFileChooser(); int returnVal = fc.showOpenDialog(null); if (returnVal == JFileChooser.APPROVE_OPTION) { File file = fc.getSelectedFile(); String pathName = file.getAbsolutePath(); result.set(pathName); } else result.set(""); } }); } catch (InvocationTargetException | InterruptedException e) { e.printStackTrace(); } // Create a file chooser. return result.get(); } You will be dealing with three files: the original text file, the compressed text file, and the decompressed text file. I suggest that you get the original file path name from the chooser above and then generate the names of the other two by putting compressed and decompressed after the file name but before the .txt extension. So if your original file is WarAndPeace.txt, then the compressed file would be WarAndPeace_compressed.txt and the decompressed file would be WarAndPeace_decompressed.txt. This naming convention makes it easier to keep track of the relationships between your various test files. I suggest that you use the String class's substring method to construct the new file names. Testing Create a small test file to help you debug your program. When you have your program working, test it on three text files that I provide, MobyDick.txt, USConstitution.txt, and WarAndPeace.txt. To give you an idea of the file sizes, here is a table of the original and compressed file sizes, in bytes, for the three files using my code: File Original size Compressed size MobyDick.txt 1193826 673580 USConstitution.txt 45119 25337 WarAndPeace.txt 3223372 1811745 Because Huffman Coding is deterministic your compressed file sizes should be the same as mine. Also create test files that will test boundary cases for your program. Boundary cases are cases that somehow are "at the edge" of valid input that your program might have to handle differently than a general file. You should figure out what sorts of files might qualify. One case is an empty file. You should be able to compress and decompress such files successfully. Neither I nor the course staff is going to make a list of these cases for you or confirm that you have found them all. Figuring out what cases need to be tested is one of the things that we hope that you will learn during this course. When you write programs outside of course work, you won't always have someone telling you what special cases you need to test. For your convenience, I have included the three text files, BufferedBitReader.java, BufferedBitWriter.java, and BinaryTree.java in this zip file. Extra credit You are now able to compress and decompress files using Huffman encoding. You used the same tree to both encode and decode, however. Normally, the tree that you used to encode the file will not be around when decoding the compressed file. For a practical system (similar to zip), the code tree has to be saved and reconstructed. Note that when you are decoding, the frequencies do not matter; you need only the tree shape and the order of the characters at the leaves. For extra credit, implement a way to write out the code tree and then read it back in and regenerate it. You should write out the code tree when you compress a file and read it in when you decompress a file. For some extra credit, write a separate file to store the information needed to reconstruct the tree. For substantial additional extra credit, write this information at the front of the file that contains the encoded characters, so that the file contains all of the information needed to decompress it. When I say "write out the tree," I mean write out enough information to be able to regenerate the tree. One option is to use some sort of parenthesized notation that you then parse to reconstruct the tree. The trick of writing the tree in preorder and inorder could work, but you would need to generate unique names at the internal nodes that do not conflict with each other or with characters at the leaves. You could reconstruct the tree from frequency data, but you may have to be careful about the way that you deal with equal frequencies. I am sure that there are many other possibilities. You would like your representation to be compact, because the goal is to compress the file. Huffman coding is not the only way to compress text. Another method called LZW compression does not require the compressor to record within the output any information about how it compressed. You can read about it in this excerpt from Algorithms Unlocked. For even more extra credit, implement LZW compression and decompression, submitting the LZW compressor and decompressor in separate files. Just compress ASCII characters. Don't worry about Unicode. You'll need to start with just 256 characters. If you are going to write a binary file of ints, I suggest using the DataOutputStream class and its writeInt method. To read a binary file of ints, I suggest using the DataInputStream class and its readInt method. You can read about these classes in the Java documentation, and you can always use your favorite search engine (what, there's more than one?) to ask about details of using these classes. It's extra credit, after all! If you come up with other interesting extra credit ideas, feel free to ask me how much extra credit they are worth. Suggestions This is not the kind of program where you can write out all the code, click the run button, select WarAndPeace.txt, and expect to be able to tell from the files printed where the bugs are in your program. Start with a short file (a couple of dozen characters), preferably one with a range of character frequencies. Call System.out.println to print out the frequency map. (The Java Map implementations have a toString that gives reasonable output.) Then print the code tree. (You did override toString in your class that holds the character and its frequency, didn't you? The BinaryTree class supplies the rest of the toString to print an indented tree.) Print out the code map. My sample solution currently has all of these print statements in it, within the bodies of if-statements that check whether a boolean debugging flag is set. It could be interesting to see the frequency numbers for WarAndPeace.txt and to see how much the codeword length varies. You can also put print statements to print out every character or bit read from files and written to files. Again, you should either comment them out before you compress WarAndPeace.txt, or put the print statements under the control of a debugging flag. In my implementation, many of the methods are static. Acknowledgement Some of the text for this assignment writeup is modified from a writeup by Delaney Granizo-Mackenzie. How to submit your work For the basic assignment, submit everything in Canvas in a single zip file. Include all the files needed to run your code, even if I provided the file and you have not changed it. That will make it easy for your section leader to run your code. Submit the basic part of the assignment on Canvas in the Lab Assignment 2 column. Uploading a single zip file will be easier than uploading lots of files to Canvas. If you do extra credit, submit a second zip file containing all the files needed to run your extra credit submission; if there's a file that is in your basic assignment and in your extra credit, include it in both zip files. Submit the basic part of the assignment on Canvas in the Lab Assignment 2 column, and extra credit in the Lab Assignment 2 Extra Credit column. Make sure that the zip files for the basic part of the assignment and the extra part have different names from each other. We ask for your extra credit to be submitted separately so that if it does not work correctly, we can still grade the basic assignment. Along with your code, include the test files that you wrote, along with their compressed and decompressed output. Include the compressed and decompressed versions of USConstitution. Do not include input or output for the WarAndPeace and MobyDick .txt files that I have provided. These files are large and unnecessary. Your section leader can run your code on these files. Grading Your section leader will run your program and will grade this assignment looking for the following things: Total of 135 points. Correctness, Efficiency, and Elegance (80 points) 10 Generates frequency map 10 Creates a priority queue of singleton trees 15 Builds the code tree 15 Traverses the code tree to generate a `Map` from characters to code words 10 Reads the input file, compresses it, and writes the compressed file. 15 Reads the compressed file, decompresses it, and writes the decompressed file. 5 Handles boundary cases correctly. Structure (25 points) 10 Good decomposition into objects and methods. 5 Proper used of instance and local variable 5 Instance variables are private 5 Proper use of parameters Style (20 points) 5 Comments for classes 5 Comments for methods (purpose, parameters, what is returned) in JavaDoc form. 4 Good names for methods, variables, parameters 3 Layout (blank lines, indentation, no line wraps, etc.) 3 Other unspecified style issues Testing (10 points) 5 Output from USConstitution.txt and your small test case(s). Note that your files size for USConstitution should match those above. 5 Output from from your boundary case test(s). Extra Credit 20 Write code tree information to a separate file and reconstruct the tree 40 Include tree information in the _compressed file (in addition to the part above). 60 LZW compression and decompression. ? Other interesting additions