Comp 212 - Lab 12: IO Stream and Parsing Comp 212 - Lab 12: Java Stream I/O - Recursive Descent Parsing Introduction In Java, input and output is defined in terms of an abstract concept called "stream". A stream is a sequence of data. If it is an input stream, it has source. If it is an output stream, it has a destination. There are two kinds of streams: byte streams and character streams. The java.io package provides a large number of classes to perform stream I/O. Mastering Java stream I/O seems like a daunting task. Do not fret. In the beginning, you only need to learn to how to manipulate a few I/O classes. As you progress, you can figure out on your own how to use the other I/O classes and even design new customized I/O classes. This tutorial describes a few commonly used I/O classes and a class used to parse the input streams called StreamTokenizer. To illustrate the use StreamTokenizer, the tutorial provides you with a few code examples, and leads you to the process of developing a recursive descent parser to parse scheme-like arithmetic expressions and build the corresponding abstract syntax trees. The InputStream and OutputStream classes Input and output in Java of byte (i.e. binary) streams are handled through subclasses of the java.io.InputStream and java.io.OutputStream classes. These classes are abstractions that Java provides for dealing with reading and writing information sequentially to/from anything you want, be it a disk, a string buffer, an enumeration, or an array of bytes. We will not spend time on these classes in this lab. Click on this link to see more details on InputStream and OutputStream. The Reader and Writer classes Input and output for characters (i.e. ASCII) streams are handled through subclasses of the java.io.Reader and java.io.Writer classes. These classes are abstractions that Java provides for dealing with reading and writing character data. Below is the UML diagram for a few commonly used Reader classes. And here is a UML diagram for common Writer classes. The section on parsing below will illustrate the use of Reader and Writer streams to read a text file containing a Scheme like arithmetic expression, parse it, and build the corresponding abstract syntax tree. StreamTokenizer and Parsing StreamTokenizer Many times when reading an input stream of characters, we need to parse it to see if what we are reading is a word or a number. This process is called "tokenizing". java.io.StreamTokenizer is a class that can do simple tokenization. TestStreamTokenizer.java is a sample program showing how to use StreamTokenizer. Copy this file and the file input.txt to your local directory and test it out. Abstract Syntax Tree To further illustrate the use of text file IO and StreamTokenizer to parse an input file, we write a program to read in scheme-like arithmetic expressions in pre-fix format, such as (* 3 (+ 4 5)), and build the corresponding abstract syntax tree (AST): * / \ 3 + / \ 4 5 Once the AST is built, one can then traverse and evaluate it, as done in Homework #3 (download the solution)! We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables), and binary operations such as addition and multiplication. The syntax of the Scheme-like arithmetic expression can be expressed as a "grammar" with the following rules. Scheme Arithmetic Expression Grammar Rules Description E :: Leaf | BinOp An expression E is either a Leaf or a BinOp. Leaf :: VarLeaf | IntLeaf A Leaf is either a VarLeaf or an IntLeaf VarLeaf :: id A VarLeaf is an id, where id is a token representing a variable. IntLeaf :: int An IntLeaf is an int, where int is a token representing an integer. BinOp :: AddOp | MulOp A BinOp is either an AddOp or a MulOp AddOp :: ( + E E ) An AddOp is a left parenthesis token, followed by a plus token, followed by an expression E, followed by another expression E, followed by a right parenthesis token. MulOp :: ( * E E ) An MulOp is a left parenthesis token, followed by a star token, followed by an expression E, followed by another expression E, followed by a right parenthesis token. An example for such an expression is: (* 34 (+ xyz 7)), where ( is the left parenthesis token * is the star token 34 is an int token + is the plus token xyz is an id token ) is the right parenthesis token. In the above grammar rules, E, Leaf, VarLeaf, IntLeaf, BinOp, AddOp, MulOp are called non-terminal symbols. They are non-terminal because they are defined in terms of other symbols. In contrast, int, id, (, ), + , * are not defined in terms of any other symbols, and are called the terminal symbols, (or sometimes 'token'). E plays a special in "starting" the grammar rules, and is called the "start" symbol. Whenever we specify a grammar, we must specify what the start symbol is. Recursive Descent Parsing The manufacturing of an abstract syntax tree (AST) for the above grammar can be thought of a factory method, makeAST(), of some abstract factory, IASTFactory. How the AST is created is a variant as there are many ways to parse the input stream. We shall implement a special parsing technique called "recursive descent parsing" (RDP). In RDP, we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the AST from the top down, based on the value of the tokens. Building an AST from the top down means building it by first looking at the grammar rule for the start symbol E and try to create an AST that represent E. As we look at the grammar rule for the start symbol E in the above, we can see that in order for RDP to make an AST, it needs to know how to make a Leaf AST and a BinOp AST. Looking at the rule for Leaf, we see that in order to make a Leaf AST, RDP needs to know how to make a VarLeaf and an IntLeaf. Making VarLeaf and IntLeaf is easy and can be done directly. Looking at the rule for BinOp, we see that in order to make a BinOp AST, RDP needs to know how to make an AddOp AST and a MulOp AST. Looking at the rule for AddOp and MulOp, we see that in order to make an AddOp AST and a MulOp AST, RDP needs to know how to make an AST, which comes from the rule for E. This completes the recursive construction. Here is some pseudo-code. makeAST(): asks the tokenizer for the next token t, and then asks t to call the appropriate factory method the int token and the id token call makeLeaf() the left parenthesis token calls makeBinOp() all other tokens should flag an error! does the above "smell" like the visitor pattern to you or not? Who are the hosts and who are the visitors? makeLeaf(): the int token calls makeIntLeaf(). the id token calls makeVarLeaf(). all other tokens should flag an error. makeIntLeaf(): create the IntLeaf directly. makeVarLeaf(): create the VarLeaf directly makeBinOp(): asks the tokenizer for the next token t, then asks t to call the appropriate factory method the + calls makeAddOp(), the * token calls makeMulOp() all other tokens should flag an error! makeAddOp(): calls makeAST() to create the first AST calls makeAST() again to create the second AST asks the tokenizer for the next token t, then asks t to create the Add AST: the right parenthesis token will instantiate an appropriate Add AST all other tokens should flag an error. makeMulOp(): analogous to makeAddOp(). In short, a RDP has a tokenizer and maintains a current token; has a method for each of the non-terminal symbols in the grammar; the method that corresponds to the start symbol initiates the parsing. Usually, parsing only involves checking for the syntax of an input expression. However, in our case, we decide to build the abstract syntax tree directly as we parse. public class RdpASTFactory implements IASTFactory { private AToken _currentTok; private ITokenizer _tokenizer: public RdpASTFactory(ITokenizer tokenizer) { _tokenizer = tokenizer;} public IAST makeAST() { // etc...} protected IAST makeLeaf() { // etc...} // other (protected)factory methods, etc. } Tokens and Tokenizer The above pseudo-code suggests designing a union of token objects coupled with visitors, and an abstract tokenizer class that knows how to get the next token. One way to implement this tokenizer class is to wrap a StreamTokenizer object. Union of Tokens public abstract class AToken { protected String _lexeme; protected AToken(String lex) { _lexeme = lex; } public String getLexeme() { return_lexeme;} public abstract Object execute(ATokVisitor v, Object inp); } public class IdToken extends AToken{ // constructor, etc. public Object execute(ATokVisitor v, Object inp) { return v.idCase(this, inp);) } } Token visitors: the variant behaviors of the tokens are encapsulated in an abstract visitor class, as follows. public abstract class ATokVisitor { /** * This is the case when the host is a concrete IdToken. * Throws an IllegalArgumentException as default behavior. * @param host a concrete IdToken. * @param inp the input needed for this ATokVisitor to perform its task. */ public Object idCase(IdToken host, Object inp) { throw new IllegalArgumentException("Wrong token!"); } // etc..., a method for each concrete token case. } interface ITokenizer ( public AToken getNextToken(); } public class StreamTokWrapper implements ITokenizer { private StreamTokenizer _st; // constructor and methods, etc... } Lab Activities Divide the lab into 4 groups. One group builds the tokenizer StreamTokWrapper: play around with TestStreamTokenizer using a file with a scheme arithmetic expression as input. One groups builds the union of tokens: AToken and its concrete variants One group writes all the needed visitors One group write the recursive descent parser. To test the code, download the solution to homework #3, unzip it, and move the various java files into their appropriate subdirectories to match the packaging. Command-line arguments Sometimes when running large programs you want to be able to set options about how your program should behave at runtime. For instance, you might have a hierarchy of debugging statements that you want to suppress at run-time, or you might have a variety of features that you want your users to have access to (check out the manual page for the Unix command "ls"--there's a vast amount of features regarding how to display your data, what to display, ad nauseam). You can handle this in several ways, but most people make use of command-line arguments to do it. Remember how all your main functions for your programs have to have the same signature? public static void main(String[] argv) { ... } The array of Strings is where you keep your command-line arguments. Everything after the class name is included in argv. Let's say we wanted to write a program that just echoed back the command-line arguments to stdout. Here's how we'd do it: public class echo { public static void main(String[] argv) { for(int j = 0; j < argv.length; j++) System.out.print(argv[j] + " "); // don't insert an end-line character yet System.out.println(); // now print an end-line ('\n') } } The System class The java.lang.System class is a useful class containing useful static members doing useful things. This should look familiar: System.out.println("Hello world!"); I'm sure you've done calls like this a million times already, but what's really going on? out is a static member of the System class; it's a PrintStream object. A PrintStream is a grandchild of OutputStream in the Java class hierarchy--it has methods implemented that print lines of text at a time as opposed to each character at a time. System.out is initialized when a program starts to what is known as standard output (stdout). Stdout is usually the monitor screen, but you can also send stdout to a file at runtime by redirecting it from the Unix command line. For example, to send the stdout to file "outfile.txt", we do the following: % java MyClass > outfile.txt There is also a System.in class popularly known as standard input (stdin). Stdin is an InputStream, initially set to taking input from the keyboard, but it can also read from a file at runtime like this: % java MyClass < infile.txt There's a third System i/o file called standard error (stderr). System.err is another PrintStream designed to direct error messages in case you don't want your output and error messages going to the same place. Stderr is also initialized to the monitor, but you can redirect it like this: % java MyClass >& errfile.txt And you can combine the redirections: % java MyClass < infile.txt > outfile.txt >& errfile.txt There's lots of other stuff the System class provides for you--check it out. NOTE: System.out and System.err are the ONLY PrintStream object you should use. Class PrintStream is deprecated. To output character streams, you should use PrintWriter (shown in a UML diagram above and in TestStreamTokenizer.java) instead. So what would you do if you wanted to, oh, I don't know, read in an arbitrarily long text file of floating point numbers? Dung X. Nguyen & Jim O'Donnell revised 04/07/02