docrep: A lightweight and efficient document representation framework Tim Dawborn e-lab, School of Information Technologies University of Sydney tim.dawborn@sydney.edu.au 2014-08-21 Corpora docrep Case study Utilities Discussion 2 Corpus processing • nlp is a data-driven research discipline • Researchers are utilising a diverse collection of large-scale corpora • For example, the OntoNotes 5 corpus has: • Tokens • pos tags • Parse trees • Predicate constituents and their arguments • Word senses • In-document coreference • Named entities • Links to the Omega ontology Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 3 Some issues with large-scale corpus processing • Representing overlapping, multi-layered annotation • Reproducibility • Scalability Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 4 Overlapping, multi-layered annotation I have seen one or two men die , *PRO* them . PRP VBP VBN CD CC CD NNS VB , -NONE- PRP . bless VB QP NP-SBJ VP S VP VP VP NP NP-SBJ S NP-SBJ S S TOP ARG1 ARG0 see.01 ARG0 die.01 ARG1 ARG0 ARG1 bless01 “2” “1” Another women wrote from Sheffield *PRO*-1 to say that in 60 yearsher of ringing `` I have never known a lady to faint in belfry .the, GPE DATE ARG0 ARGM-DIR ARGM-PNC write.01 ARG0 ARG1 say.01 ARG0 ARGM-LOC faint.01 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 5 Representing overlapping, multi-layered annotation • Flat-file representations of corpora are common • Easy to inspect and (often) easy to process • Cannot easily store multiple or nested annotation layers • Custom I/O required for each format % less nw/wsj/00/wsj_0089.parse ... (TOP (S (S (NP-SBJ (PRP I)) (VP (VBP have) (VP (VBN seen) (S (NP-SBJ (QP (CD one) (CC or) (CD two)) (NNS men)) (VP (VB die)))))) (, ,) (S (NP-SBJ (-NONE- *PRO*)) (VP (VB bless) (NP (PRP them)))) (. .))) Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 5 Representing overlapping, multi-layered annotation • Flat-file representations of corpora are common • Easy to inspect and (often) easy to process • Cannot easily store multiple or nested annotation layers • Custom I/O required for each format % less nw/wsj/00/wsj_0089.name... Another women wrote from Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 5 Representing overlapping, multi-layered annotation • Flat-file representations of corpora are common • Easy to inspect and (often) easy to process • Cannot easily store multiple or nested annotation layers • Custom I/O required for each format % less nw/wsj/00/wsj_0089.sense ... nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 1 woman-n ?,? 1 nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 2 write-v 1 nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 6 say-v ?,? 1 nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 13 ring-v 1 nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 19 know-v 4 nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 64 2 see-v ?,? 3 nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 64 7 die-v ?,? 1 nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 64 9 bless-v 1 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 5 Representing overlapping, multi-layered annotation • Flat-file representations of corpora are common • Easy to inspect and (often) easy to process • Cannot easily store multiple or nested annotation layers • Custom I/O required for each format % less nw/wsj/00/wsj_0089.corefSheffield to say that in her60 years of ringing , ‘‘ I have never known a lady to faint in the belfry . I have seen one or two men die , bless them .... Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 5 Representing overlapping, multi-layered annotation • Flat-file representations of corpora are common • Easy to inspect and (often) easy to process • Cannot easily store multiple or nested annotation layers • Custom I/O required for each format % less nw/wsj/00/wsj_0089.prop ... 63 2 gold write-v write.01 ----- 2:0-rel 0:1-ARG0 3:1-ARGM-DIR 5:2-ARGM-PNC 63 7 gold say-v say.01 ----- 7:0-rel 5:0-ARG0 8:1-ARG1 63 20 gold know-v know.01 ----- 20:0-rel 9:1-ARGM-TMP 17:1-ARG0 19:1-ARGM-NEG 21:2-ARG1 63 24 gold faint-v faint.01 ----- 24:0-rel 21:1-ARG0 25:1-ARGM-LOC 64 2 gold see-v see.01 ----- 2:0-rel 0:1-ARG0 3:3-ARG1 64 7 gold die-v die.01 ----- 7:0-rel 3:2-ARG1 64 10 gold bless-v bless.01 ----- 10:0-rel 9:1-ARG0 11:1-ARG1 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 6 Representing overlapping, multi-layered annotation • Traditional structured storage representations, such as databases, are less common • Harder to inspect • May require specialised search tools to extract information • Document Representation Frameworks (drfs) aim to provide a better solution for the storage of corpora Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 7 Document Representation Frameworks • gate and uima are the main drfs used within the cl community • Very Java oriented • Neither work very well in a non-ide context • Both are quite large frameworks and have a high entry cost • Leads researchers to roll-their-own makeshift drf instead • This is obviously not a good thing • The field lacks a language-agnostic, non-ide friendly, low entry cost drf Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 8 What is docrep? • docrep is a programming language agnostic drf specification • We have provided implementations in C++, Python, and Java • Available from https://github.com/schwa-lab (mit licence) • Programming language agnostic framework • Streaming document representation • Self-describing on the wire; no external files needed • Well-established binary serialisation format (MessagePack) • unix tool philosophy Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 9 docrep has been heavily used • We have used docrep for while, but the two COLING papers are the first papers presenting it • Primary data format used in our research lab since mid-2012 • Both for research and commercial projects • All of our tools talk docrep • Common data format and run-time representation across people, tools, and programming languages • No custom I/O code required once in docrep format Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 10 Why should you use docrep? • It looks like the data modelling code you would write anyway 1 class Token(dr.Ann): 2 span = dr.Slice() 3 raw = dr.Text() 4 5 class NamedEntity(dr.Ann): 6 span = dr.Slice(Token) 7 tag = dr.Text() 8 start_offset = dr.Field() 9 end_offset = dr.Field() 10 11 class Doc(dr.Doc): 12 doc_id = dr.Text() 13 tokens = dr.Store(Token) 14 named_entities = dr.Store(NamedEntity) 15 16 reader = dr.Reader(open(filename, 'rb'), Doc) 17 for doc in reader: 18 process_doc(doc) Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 11 Why should you use docrep? • apis in C++, Python, and Java 1 @dr.Ann 2 public class Token extends AbstractAnn { 3 @dr.Field public ByteSlice span; 4 @dr.Field public String raw; 5 } 6 @dr.Ann 7 public class NamedEntity extends AbstractAnn { 8 @dr.Pointer public SliceAnother women wrote from Sheffield *PRO*-1 to say that inher 60 years of ringing , ‘‘I have never known a lady to faint in the belfry .I have seenone or two men die , *PRO* blessthem .span; 9 @dr.Field public String tag; 10 @dr.Field public int startOffset; 11 @dr.Field public int endOffset; 12 } 13 @dr.Doc 14 public class Doc extends AbstractDoc { 15 @dr.Field public String docId; 16 @dr.Store public Store tokens; 17 @dr.Store public Store namedEntities; 18 } Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 12 Why should you use docrep? • No xml – use native type descriptors 1 2 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 13 Why should you use docrep? • Command-line friendly – no ide required 1 % dr grep 'doc.lang == "en" && len(doc.tokens) < 100' corpus.dr | \ 2 % dr count -s named_entities | sort -rn | head -n 1 3 32 • docrep includes a rich set of unix tool style commands for very common operations • ⇒ no custom code required for performing common tasks Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 14 Why should you use docrep? • As compact and efficient as what Google uses (Protocol Buffers) • Converting the CoNLL 2003 ner corpus to docrep using different commonly used binary serialisation protocols: Self- Uncompressed Compressed describing Time (s) Size (MB) Time (s) Size (MB) Original data – – 31.30 1.0 5.95 bson X 2.5 188.42 5.3 30.32 MessagePack X 1.6 52.15 3.2 16.61 Protocol Buffers × 1.4 51.51 3.5 18.52 Thrift × 1.0 126.12 3.5 20.64 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 15 Why should you use docrep? • Lazy loading api • Only the stores and fields specified at runtime will be deserialised • Read-only and non-deserialised data streamed as a memcpy • Very fast modifications to a subset of the document schema • E.g. adding a new field or annotation layer Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 16 How is docrep different to other drfs? • Streaming document representation • Self-describing on the wire; no external files needed • unix tool philosophy • Formally can represent annotations over sequential spans of other annotation instances • Cannot formally represent cross-document information due to streaming model Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 17 Slices over bytes and annotation instances • The Slice field type represents an 〈index, length〉 integer pair into some ordered list of data • Slices over the original document byte array • Slices over a Store of annotation instances 1 class Token(dr.Ann): 2 span = dr.Slice() # Slice over bytes. 3 raw = dr.Text() 4 5 class NamedEntity(dr.Ann): 6 span = dr.Slice(Token) # Slice over objects in the `tokens` store. 7 tag = dr.Text() 8 start_offset = dr.Field() 9 end_offset = dr.Field() 10 11 class Doc(dr.Doc): 12 doc_id = dr.Text() 13 tokens = dr.Store(Token) 14 named_entities = dr.Store(NamedEntity) Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 18 What are docrep annotations able to model? • Most kinds of data can be stored as attributes of an annotation • Primitive data types • Byte and Unicode strings • Pointers to other Ann instances on the document • Lists of pointers to other Ann instances • Byte or annotation slices 1 class ParseNode(dr.Ann): 2 tag = dr.Text() 3 token = dr.Pointer(Token) 4 parent = dr.SelfPointer() 5 children = dr.SelfPointers() 6 score = dr.Field() Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 19 Converting a multi-layered corpus to docrep • The OntoNotes 5 corpus is a good candidate for comparing the effectiveness of drfs • 15 710 documents; 13 109 English, 2002 Chinese, and 599 Arabic • 8 overlapping annotation layers Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 20 The conversion process • We converted OntoNotes 5 to docrep and uima • Performed the same conversion process using all available apis • uima: Java and C++ • docrep: Java, C++, and Python • Empirical measurements: • Conversion time • Serialisation time • Size of serialised corpus • Non-empirical measurements: • Ease of installation • Ease of use as a developer Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 21 How does docrep compare empirically? • docrep is up to 34 times faster than uima • docrep is up to 9 times more compact than uima uima Java Java Java Java C++ C++ C++ xmi xcas bin cbin xmi xcas bin Conversion time (s) 25 25 25 25 77 77 77 Serialisation time (s) 131 122 2103 76 630 611 695 Size on disk (MB) 1894 3252 1257 99 2141 3252 2135 docrep Java C++ Python Conversion time (s) 12 12 27 Serialisation time (s) 61 23 32 Size on disk (MB) 371 371 371 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 22 How does docrep compare as a Java developer? • uima: • Download and install Eclipse • Install the special uima Eclipse plugins • Use Eclipse plugins to generate uima xml type files • Use Eclipse plugins to convert uima xml files into Java source files • Start using your types in Java code • docrep: • % mvn install libschwa-java • Start writing and using your types in Java code • Supports Java ≥ 1.6 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 23 How does docrep compare as a Python developer? • uima: • Doesn’t exist • docrep: • % pip install libschwa-python • Start writing and using your types in Python code • Supports Python 2.7 and ≥ 3.3 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 24 How does docrep compare as a C++ developer? • uima: • This was a very unpleasant process • Download, compile, install specific versions of multiple very large third-party libraries • Apache Portable Runtime, ICU, Apache Xerces, JNI • Discover there is little or out-of-date documentation • Read the source code to try and work out how to use it • docrep: • OS X: % brew install libschwa • Other: % ./configure && make && make install • Start writing and using your types in C++ code • Written in C++11 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 25 Need for common manipulation tools • Corpus linguistics often involves repeated tasks • Filtering • Counting • Sorting • Manipulating • etc. • These operations are harder with a binary format • Need a set of useful manipulation tools • Ideally, that the user already knows how to use Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 26 Command-line utilities for various tasks • For people who use unix tools, existing oiafs aren’t idiomatic • docrep is a streaming format ⇒ unix tool friendly • Document streams sent between processes using unix pipes • We provide command-line tools that mimic their unix equivalent • Python api makes common operations very easy to implement • These tools make these common operations first-class citizens • ⇒ No custom code required Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 27 unix tools for manipulating corpora Command Description Similar in unix dr grep Select documents grep dr head Select prefix head dr tail Select suffix tail dr less View raw annotations less dr count Aggregate wc dr sort Reorder documents sort dr sample Random documents shuf -n dr split Partition split dr format Excerpt printf / awk dr shell Interactive exploration python Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 28 Using the command-line tools • Working example of the wsj OntoNotes 5 files % ls wsj_0001.* wsj_0001.name wsj_0001.onf wsj_0001.parse wsj_0001.prop wsj_0001.sense • Each annotation layer is stored in a separate file • Each file has a completely different format • wsj_0001.dr is the equivalent docrep file Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 29 Viewing annotation layers • How do we inspect the different annotation layers? • Plain text: % less wsj_0001.name • docrep: % dr less wsj_0001.dr • docrep file contains all of the different annotation layers Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 30 Viewing annotation layers Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 31 Viewing annotation layers Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 32 Counting annotation instances • How many of each annotation instance are there on wsj 0001? • Plain text: different process for each format % egrep -o '([^() ]+ [^()]+)' wsj_0001.parse | wc -l 31 % egrep -o '<[A-Z]+ TYPE="[^"]*">[^<]+[A-Z]+>' wsj_0001.name | wc -l 6 • Error prone, especially when dealing with sgml or xml • docrep: % dr count -a wsj_0001.dr ndocs ... named_entities parse_nodes ... tokens 1 ... 6 53 ... 31 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 33 Combining annotations from different documents • Combine the named entity annotations for all documents together to send to third-party tool you are evaluating? • Plain text: not easily, since they’re in sgml/xml. • Requires a specific script to combine the sgml/xml together • docrep: % cat *.dr | the-external-tool • docrep files can simply be concatenated together Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 34 Partitioning corpora • Split your corpus into 10 folds for cross-validation. • Plain text: custom script. Very format dependant. • docrep: # Round-robin split the corpus into 10 folds. % cat *.dr | dr split k 10 # Or add some random shuffling into the mix. % cat *.dr | dr sort random | dr split k 10 # Or split by an attribute on the document. % cat *.dr | dr split -t '{key}.dr' py doc.genre Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 35 An interactive Python shell for exploration • Interpreted language can harness the self-describing representation • dr shell for exploring docrep streams interactively • Reads each document in sequence • Inspect objects using dir or tab-completion Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 36 Interactive example Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 37 A paradigm for rapid development • These tools grew organically as we needed them • Facilitates rapid prototyping • Facilitates debugging and quality assurance Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 38 Different approaches for different people 1 % dr grep 'doc.id ∼ /x-\d+/' corpus.dr | \ 2 % dr count -s tokens | sort -rn | head -n 1 Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 39 Conclusion • We want people to use drfs • So let’s make it easy for them to do so • Programming-language agnostic • Low cost of entry • No infrastructure or ide requirements • Rich set of unix tool inspired command-line tools • Makes docrep easier to use than plain-text files Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21 Corpora docrep Case study Utilities Discussion 40 COLING and Acknowledgements • OIAF4HLT workshop talk: Saturday, 14:00 • Main COLING talk: Monday, 16:35 • Special thanks to Will Radford and Joel Nothman for their contributions to docrep over the past years • This work was supported by ARC Discovery grant DP1097291 and the Capital Markets CRC Computable News project Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21ontonotes5.to_uima.types.NamedEntity 34 uima.tcas.Annotation 56 197 10tag 8uima.cas.String 911 14startOffset 12uima.cas.Integer 1315 18endOffset 16uima.cas.Integer 17