Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Command-line utilities for managing and exploring
annotated corpora
Joel Nothman Tim Dawborn James R. Curran
e-lab, School of Information Technologies
University of Sydney
{joel.nothman,tim.dawborn,james.r.curran}@sydney.edu.au
OIAF4HLT
At the 25th International Conference on Computational Linguistics (COLING 2014)
Introduction docrep Utilities Interactive Discussion 2
Adopting a framework means learning new tools
• Frameworks normally involve custom data formats
• Plain text files are easy to inspect and transform
• Need tools for rapid corpus exploration and prototyping
• Users may already be familiar with unix tools, sql or Python
• Give them familiar tools
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 3
A streaming document representation
• Dawborn and Curran (COLING 2014) introduce docrep
• Lightweight, easy to use document representation framework
• https://github.com/schwa-lab
• Programming language agnostic framework
• Streaming document representation
• Self-describing on the wire; no external files needed
• Well-established binary serialisation format (MessagePack)
• unix-tool philosophy
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 4
docrep has been heavily used
• Primary data format used in our research lab since mid-2012
• Both for research projects and commercial projects
• All of our tools talk docrep
• Common data format and run-time representation between
people, tools, and programming languages
• No custom I/O code required
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 5
A simple example: Python docrep
1 class Token(dr.Ann):
2 span = dr.Slice()
3 raw = dr.Text()
4
5 class NamedEntity(dr.Ann):
6 span = dr.Slice(Token)
7 tag = dr.Text()
8 start_offset = dr.Field()
9 end_offset = dr.Field()
10
11 class Doc(dr.Doc):
12 doc_id = dr.Text()
13 tokens = dr.Store(Token)
14 named_entities = dr.Store(NamedEntity)
• Defines Token and NamedEntity annotation types
• Defines a Document to store them on
• This is a complete functional example in Python.
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 6
A simple example: Java docrep
1 @dr.Ann
2 public class Token extends AbstractAnn {
3 @dr.Field public ByteSlice span;
4 @dr.Field public String raw;
5 }
6
7 @dr.Ann
8 public class NamedEntity extends AbstractAnn {
9 @dr.Pointer public Slice span;
10 @dr.Field public String tag;
11 @dr.Field public int startOffset;
12 @dr.Field public int endOffset;
13 }
14
15 @dr.Doc
16 public class Doc extends AbstractDoc {
17 @dr.Field public String docId;
18 @dr.Store public Store tokens;
19 @dr.Store public Store namedEntities;
20 }
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 7
A simple example: uima
• Create the xml type descriptor.
1 
2 ontonotes5.to_uima.types.NamedEntity
3 
4 uima.tcas.Annotation
5 
6 
7 tag
8 uima.cas.String
9 
10 
11 startOffset
12 uima.cas.Integer
13 
14 
15 endOffset
16 uima.cas.Integer
17 
18 
19 
• Invoke the jcasgen external program to convert the xml into the
appropriate uima Java source file.
• Repeat for each annotation type.
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 8
Command-line utilities for various tasks
• For researchers who use unix tools, existing drfs aren’t idiomatic
• docrep is a streaming format ⇒ unix tool friendly
• Document streams sent between processes using unix pipes
• We provide command-line tools that mimic their unix equivalent
• Aims to feel as idiomatic and intuitive as possible
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 9
unix tools for manipulating corpora
Command Description Similar in unix
dr count Aggregate wc
dr dump View raw annotations hexdump
dr format Excerpt printf / awk
dr grep Select documents grep
dr head Select prefix head
dr less View raw annotations less
dr sample Random documents shuf -n
dr shell Interactive exploration python
dr sort Reorder documents sort
dr split Partition split
dr tail Select suffix tail
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 10
Using the command-line tools
• Working example of the wsj OntoNotes 5 files
% ls wsj_0001.*
wsj_0001.dr wsj_0001.name wsj_0001.onf
wsj_0001.parse wsj_0001.prop wsj_0001.sense
• Each annotation layer is stored in a separate file
• Each file has a completely different format
• wsj_0001.dr is the equivalent docrep file
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 11
Viewing annotation layers
• How do we inspect the different annotation layers?
• Plain text:
% less wsj_0001.name
• docrep:
% dr less wsj_0001.dr
• docrep file contains all of the different annotation layers
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 12
Counting annotation instances
• How many of each annotation instance are there on wsj 0001?
• Plain text: different process for each format
% egrep -o '([^() ]+ [^()]+)' wsj_0001.parse | wc -l
31
% egrep -o '<[A-Z]+ TYPE="[^"]*">[^<]+' wsj_0001.name | wc -l
6
• Very error prone, especially when dealing with sgml or xml
• docrep:
% dr count -a wsj_0001.dr
ndocs ... named_entities parse_nodes ... tokens
1 ... 6 53 ... 31
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 13
Combining annotations from different documents
• Combine the named entity annotations for all documents together
to send to third-party tool you are evaluating?
• Plain text: not easily, since they’re in sgml/xml.
% ./my-combining-script *.name | the-external-tool
• Requires a specific script to combine the sgml/xml together
• docrep:
% cat *.dr | the-external-tool
• docrep files can simply be concatenated together
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 14
Partitioning corpora
• Split your corpus into N folds for cross-validation.
• Plain text: custom script. Very format dependant.
% ./my-splitting-script -n N *
• docrep:
% cat *.dr | dr split k N
# Or add some random shuffling into the mix.
% cat *.dr | dr sort random | dr split k N
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 15
An interactive Python shell for exploration
• Interpreted language can harness the self-describing
representation
• dr shell for exploring docrep streams interactively
• Reads each document in sequence
• Inspect objects using dir or tab-completion
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 16
Paradigms of rapid development
• These tools grew organically as we needed them
• Facilitates rapid prototyping
• Facilitates debugging and quality assurance
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014
Introduction docrep Utilities Interactive Discussion 17
Conclusion
• We want people to use drfs
• So let’s make it easy for them to do so
• These command-line tools harness the power of docrep
• Idiomatic
• Expressive
• Intuitive for users of unix tools
• Makes docrep easier to use than plain-text files
Tim Dawborn Command-line utilities for managing and exploring annotated corpora OIAF4HLT 2014