Assignment 1 - COMP4317/9317: XML and Databases
Assignment 1 DOM Parsing for XML Documents Due Date: March 15th, 2010 UPDATE: Despite our many warnings in assignment 1, many students who had correct results used a different formatting and therefore failed the automatic testings. As a special consideration, people who did such mistakes are allowed to resubmit until Monday 29th March 11:59 PM. The resubmission should not differ by more than 10 lines from the original one. If you choose to resubmit, you will get a global penalty of -0.5pt, no late penalty and the tests will be re-run. This will only be done for assignment 1. The update submission will be marked at the same time as assignment 2. There are two main models for parsing XML documents: DOM and SAX. In this assignment you are asked to write a program which uses the DOM to parse access, and navigate through the nodes of the input XML file. Your program should provide the following functions. Provide satistical information about the input XML document such as the total number of nodes of the input XML document (count only nodes of type Element, Attribute, (non-empty) Text, CDATA, Comment, and ProcessingInstruction) the number of nodes of each node type (Element / Attribute / non-empty Text) in the input XML document the number of empty text nodes. A text node is empty if it contains no character, or only blank, '\n', and '\t' characters. the maximal height of the XML tree (= longest path from root to a leaf; only considering element and non-empty text nodes) the maximal length of any sibling list (= maximal number of children of any node, only considerly element and non-empty text nodes) the number of distinct element names the number of distinct attribute names for each distinct element name (in alphabetical order) print the number of nodes with that name the set of element names that appear as children of those nodes and the set of element names that appear as descendants of those nodes. (those sets should be alphabetically order). Serialize the DOM tree back to an XML document using the following three user options no linebreaks/whitespace at all between the tree nodes (option "p1") seperate line for each item (option "p2") pretty print with identation for (nested) tree nodes (option "p3"); this means that after N start-element tags (and before the corresponding end-element tag) each line should start with N indentation characters ("\t"), and after each start/end-element tag or (non-empty, non-attribute) node there is a return ("\n"). Note: For s, p1, p2, p3: Remove/ignore all empty text nodes (except for their explicit count in s). For p1, p2, p3: Remove all leading and trailing whitespace in all text nodes. For s, as total number of nodes, do not count the extra document node provided by DOM (only count element, proc. instruction, comment, attribute, text, and CDATA nodes) For p2, an "item" means an opening/closing tag, or some text. For p3, indent using "\t" (TAB). For p1,p2,p3: It is OK to always print as first line "", even if that line was not present in the original XML input file. Your program should take a command line option which is one of [s|p1|p2|p3] (with "s" being the default) and as argument the file name of an XML document. Note: Do NOT worry about a DTD that could be present in the XML file. Your program will be tested with DTD-less files only! Write your program in C/C++ or Java, and use the correspondig Xerces DOM parser libraries: Xerces C++ Parser Xerces2 Java Parser Submit your code using one of the following commands on the CSE system: % give cs4317 ass1 filename.cpp % give cs4317 ass1 filename.java where filename is the (arbitrary) name of your source file. Java programs MUST compile (on CSE machines) with javac filename.java (some machines need javac -cp /usr/share/java/xerces.jar filename.java (or xercesImpl.jar)) C++ programs MUST compile (on CSE machines) with g++ -lxerces-c filename.cpp Example Runs Assume you have a file "test.xml" which consists of the following XML snipplet:
TCP/IP Illustrated
StevensJohn
Addison-Wesley
65.95
Your program should behave as follows (assuming the executable is named "DOMcat"):
> DOMcat test.xml
Total number of nodes: 15
Number of element nodes: 7
Number of attribute nodes: 3
Number of text nodes: 5
Number of empty text nodes: 5
Maximal height: 4
Maximal length of sibling list: 4
Number of distinct element names: 7
Number of distinct attribute names: 3
author, 1, {first, last}, {first, last}
book, 1, {author, price, publisher, title}, {author, first, last, price, publisher, title}
first, 1, {}, {}
last, 1, {}, {}
price, 1, {}, {}
publisher, 1, {}, {}
title, 1, {}, {}
> DOMcat -p1 test.xml
TCP/IP IllustratedStevensJohnAddison-Wesley65.95
> DOMcat -p2 test.xml
TCP/IP Illustrated
Stevens
John
Addison-Wesley
65.95
> DOMcat -p3 test.xml
TCP/IP Illustrated
Stevens
John
Addison-Wesley
65.95
As mentioned above, it is perfectly OK if your program always prints out as first line " cat t1.xml
>>>>blub]]>b
> java DOMTravel -s t1.xml
Total number of nodes: 3
Number of element nodes: 1
Number of attribute nodes: 0
Number of text nodes: 1
Number of empty text nodes: 0
Maximal height: 2
Maximal length of sibling list: 1
Number of distinct element names: 1
Number of distinct attribute names: 0
book, 1, {}, {}
> cat t2.xml
b>>>>blub]]>b
> java DOMTravel -s t2.xml
Total number of nodes: 4
Number of element nodes: 1
Number of attribute nodes: 0
Number of text nodes: 2
Number of empty text nodes: 0
Maximal height: 2
Maximal length of sibling list: 2
Number of distinct element names: 1
Number of distinct attribute names: 0
book, 1, {}, {}
> cat t3.xml
>>>>blub]]>
> java DOMTravel -s t3.xml
Total number of nodes: 2
Number of element nodes: 1
Number of attribute nodes: 0
Number of text nodes: 0
Number of empty text nodes: 0
Maximal height: 1
Maximal length of sibling list: 0
Number of distinct element names: 1
Number of distinct attribute names: 0
book, 1, {}, {}
Runs on Larger Files The following 2 XML files are taken from the "Testbed" zip file from this page. For this xml file you should get the following for the "s" option:
Total number of nodes: 291
Number of element nodes: 146
Number of attribute nodes: 29
Number of text nodes: 116
Number of empty text nodes: 146
Maximal height: 4
Maximal length of sibling list: 29
Number of distinct element names: 6
Number of distinct attribute names: 1
Course, 29, {Day, Instructor, Place, Time}, {Day, Instructor, Place, Time}
Day, 29, {}, {}
Instructor, 29, {}, {}
Place, 29, {}, {}
Time, 29, {}, {}
arizona, 1, {Course}, {Course, Day, Instructor, Place, Time}
For the p1, p2, and p3 options you should get p1, p2, and p3 (download and view these using a text editor). For this xml file you should get the following for the "s" option:
Total number of nodes: 433
Number of element nodes: 244
Number of attribute nodes: 27
Number of text nodes: 162
Number of empty text nodes: 244
Maximal height: 6
Maximal length of sibling list: 27
Number of distinct element names: 10
Number of distinct attribute names: 1
bu, 1, {course}, {college, course, courseInfo, days, instructor, room, schedule, time, title}
college, 27, {}, {}
course, 27, {college, courseInfo}, {college, courseInfo, days, instructor, room, schedule, time, title}
courseInfo, 27, {instructor, schedule, title}, {days, instructor, room, schedule, time, title}
days, 27, {}, {}
instructor, 27, {}, {}
room, 27, {}, {}
schedule, 27, {days, room, time}, {days, room, time}
time, 27, {}, {}
title, 27, {}, {}
For the p1, p2, and p3 options you should get p1, p2, and p3. Warning: download the files linked below and inspect them manually. Otherwise your browser might crash.. For this xml file you should get the following for the "s" option:
Total number of nodes: 514800
Number of element nodes: 277072
Number of attribute nodes: 733
Number of text nodes: 236994
Number of empty text nodes: 263595
Maximal height: 9
Maximal length of sibling list: 733
Number of distinct element names: 24
Number of distinct attribute names: 1
a, 23280, {}, {}
b, 4431, {}, {}
bib, 3487, {}, {}
cr, 9168, {}, {}
def, 4074, {cr}, {cr}
dictionary, 1, {e}, {a, b, bib, cr, def, e, et, hw, hwg, i, loc, pos, pr, q, qd, qp, qt, s, ss, vd, vf, vfl, w}
e, 733, {et, hwg, ss, vfl}, {a, b, bib, cr, def, et, hw, hwg, i, loc, pos, pr, q, qd, qp, qt, s, ss, vd, vf, vfl, w}
et, 512, {cr}, {cr}
hw, 983, {}, {}
hwg, 733, {hw, pos, pr}, {hw, pos, pr}
i, 4326, {}, {}
loc, 42632, {}, {}
pos, 729, {}, {}
pr, 790, {}, {}
q, 42632, {a, bib, loc, qd, qt, w}, {a, b, bib, cr, i, loc, qd, qt, w}
qd, 42632, {}, {}
qp, 4074, {q}, {a, b, bib, cr, i, loc, q, qd, qt, w}
qt, 42632, {b, cr, i}, {b, cr, i}
s, 4074, {def, qp}, {a, b, bib, cr, def, i, loc, q, qd, qp, qt, w}
ss, 733, {s}, {a, b, bib, cr, def, i, loc, q, qd, qp, qt, s, w}
vd, 284, {}, {}
vf, 1308, {}, {}
vfl, 192, {vd, vf}, {vd, vf}
w, 42632, {}, {}
For the p2 and p3 options you should get p2 and p3. CRICOS Provider Number: 00098G