Command-line utilities for managing and exploring annotated corpora Joel Nothman Tim Dawborn James R. Curran e-lab, School of Information Technologies University of Sydney {joel.nothman,tim.dawborn,james.r.curran}@sydney.edu.au OIAF4HLT At the 25th International Conference on Computational Linguistics (COLING 2014) Introduction docrep Utilities Interactive Discussion 2 Adopting a framework means learning new tools • Frameworks normally involve custom data formats • Plain text files are easy to inspect and transform • Need tools for rapid corpus exploration and prototyping • Users may already be familiar with unix tools, sql or Python • Give them familiar tools Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 3 A streaming document representation • Dawborn and Curran (COLING 2014) introduce docrep • Lightweight, easy to use document representation framework • https://github.com/schwa-lab • Programming language agnostic framework • Streaming document representation • Self-describing on the wire; no external files needed • Well-established binary serialisation format (MessagePack) • unix-tool philosophy Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 4 docrep has been heavily used • Primary data format used in our research lab since mid-2012 • Both for research projects and commercial projects • All of our tools talk docrep • Common data format and run-time representation between people, tools, and programming languages • No custom I/O code required Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 5 A simple example: Python docrep 1 class Token(dr.Ann): 2 span = dr.Slice() 3 raw = dr.Text() 4 5 class NamedEntity(dr.Ann): 6 span = dr.Slice(Token) 7 tag = dr.Text() 8 start_offset = dr.Field() 9 end_offset = dr.Field() 10 11 class Doc(dr.Doc): 12 doc_id = dr.Text() 13 tokens = dr.Store(Token) 14 named_entities = dr.Store(NamedEntity) • Defines Token and NamedEntity annotation types • Defines a Document to store them on • This is a complete functional example in Python. Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 6 A simple example: Java docrep 1 @dr.Ann 2 public class Token extends AbstractAnn { 3 @dr.Field public ByteSlice span; 4 @dr.Field public String raw; 5 } 6 7 @dr.Ann 8 public class NamedEntity extends AbstractAnn { 9 @dr.Pointer public Slicespan; 10 @dr.Field public String tag; 11 @dr.Field public int startOffset; 12 @dr.Field public int endOffset; 13 } 14 15 @dr.Doc 16 public class Doc extends AbstractDoc { 17 @dr.Field public String docId; 18 @dr.Store public Store tokens; 19 @dr.Store public Store namedEntities; 20 } Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 7 A simple example: uima • Create the xml type descriptor. 1 2 • Invoke the jcasgen external program to convert the xml into the appropriate uima Java source file. • Repeat for each annotation type. Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 8 Command-line utilities for various tasks • For researchers who use unix tools, existing drfs aren’t idiomatic • docrep is a streaming format ⇒ unix tool friendly • Document streams sent between processes using unix pipes • We provide command-line tools that mimic their unix equivalent • Aims to feel as idiomatic and intuitive as possible Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 9 unix tools for manipulating corpora Command Description Similar in unix dr count Aggregate wc dr dump View raw annotations hexdump dr format Excerpt printf / awk dr grep Select documents grep dr head Select prefix head dr less View raw annotations less dr sample Random documents shuf -n dr shell Interactive exploration python dr sort Reorder documents sort dr split Partition split dr tail Select suffix tail Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 10 Using the command-line tools • Working example of the wsj OntoNotes 5 files % ls wsj_0001.* wsj_0001.dr wsj_0001.name wsj_0001.onf wsj_0001.parse wsj_0001.prop wsj_0001.sense • Each annotation layer is stored in a separate file • Each file has a completely different format • wsj_0001.dr is the equivalent docrep file Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 11 Viewing annotation layers • How do we inspect the different annotation layers? • Plain text: % less wsj_0001.name • docrep: % dr less wsj_0001.dr • docrep file contains all of the different annotation layers Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 12 Counting annotation instances • How many of each annotation instance are there on wsj 0001? • Plain text: different process for each format % egrep -o '([^() ]+ [^()]+)' wsj_0001.parse | wc -l 31 % egrep -o '<[A-Z]+ TYPE="[^"]*">[^<]+[A-Z]+>' wsj_0001.name | wc -l 6 • Very error prone, especially when dealing with sgml or xml • docrep: % dr count -a wsj_0001.dr ndocs ... named_entities parse_nodes ... tokens 1 ... 6 53 ... 31 Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 13 Combining annotations from different documents • Combine the named entity annotations for all documents together to send to third-party tool you are evaluating? • Plain text: not easily, since they’re in sgml/xml. % ./my-combining-script *.name | the-external-tool • Requires a specific script to combine the sgml/xml together • docrep: % cat *.dr | the-external-tool • docrep files can simply be concatenated together Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 14 Partitioning corpora • Split your corpus into N folds for cross-validation. • Plain text: custom script. Very format dependant. % ./my-splitting-script -n N * • docrep: % cat *.dr | dr split k N # Or add some random shuffling into the mix. % cat *.dr | dr sort random | dr split k N Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 15 An interactive Python shell for exploration • Interpreted language can harness the self-describing representation • dr shell for exploring docrep streams interactively • Reads each document in sequence • Inspect objects using dir or tab-completion Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 16 Paradigms of rapid development • These tools grew organically as we needed them • Facilitates rapid prototyping • Facilitates debugging and quality assurance Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014 Introduction docrep Utilities Interactive Discussion 17 Conclusion • We want people to use drfs • So let’s make it easy for them to do so • These command-line tools harness the power of docrep • Idiomatic • Expressive • Intuitive for users of unix tools • Makes docrep easier to use than plain-text files Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014ontonotes5.to_uima.types.NamedEntity 34 uima.tcas.Annotation 56 197 10tag 8uima.cas.String 911 14startOffset 12uima.cas.Integer 1315 18endOffset 16uima.cas.Integer 17