Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
docrep: A lightweight and efficient document
representation framework
Tim Dawborn
e-lab, School of Information Technologies
University of Sydney
tim.dawborn@sydney.edu.au
2014-08-21
Corpora docrep Case study Utilities Discussion 2
Corpus processing
• nlp is a data-driven research discipline
• Researchers are utilising a diverse collection of large-scale corpora
• For example, the OntoNotes 5 corpus has:
• Tokens
• pos tags
• Parse trees
• Predicate constituents and their arguments
• Word senses
• In-document coreference
• Named entities
• Links to the Omega ontology
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 3
Some issues with large-scale corpus processing
• Representing overlapping, multi-layered annotation
• Reproducibility
• Scalability
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 4
Overlapping, multi-layered annotation
I have seen one or two men die , *PRO* them .
PRP VBP VBN CD CC CD NNS VB , -NONE- PRP .
bless
VB
QP
NP-SBJ VP
S
VP
VP
VP
NP
NP-SBJ
S
NP-SBJ
S
S
TOP
ARG1
ARG0
see.01
ARG0
die.01
ARG1
ARG0
ARG1
bless01
“2” “1”
Another women wrote from Sheffield *PRO*-1 to say that in 60 yearsher of ringing
`` I have never known a lady to faint in belfry .the,
GPE DATE
ARG0 ARGM-DIR ARGM-PNC
write.01
ARG0 ARG1
say.01
ARG0 ARGM-LOC
faint.01
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 5
Representing overlapping, multi-layered annotation
• Flat-file representations of corpora are common
• Easy to inspect and (often) easy to process
• Cannot easily store multiple or nested annotation layers
• Custom I/O required for each format
% less nw/wsj/00/wsj_0089.parse
...
(TOP (S (S (NP-SBJ (PRP I))
(VP (VBP have)
(VP (VBN seen)
(S (NP-SBJ (QP (CD one)
(CC or)
(CD two))
(NNS men))
(VP (VB die))))))
(, ,)
(S (NP-SBJ (-NONE- *PRO*))
(VP (VB bless)
(NP (PRP them))))
(. .)))
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 5
Representing overlapping, multi-layered annotation
• Flat-file representations of corpora are common
• Easy to inspect and (often) easy to process
• Cannot easily store multiple or nested annotation layers
• Custom I/O required for each format
% less nw/wsj/00/wsj_0089.name

...
Another women wrote from Sheffield
to say that in her 60 years of ringing
, ‘‘ I have never known a lady to faint in the belfry .
I have seen one or two men die , bless them .

Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 5
Representing overlapping, multi-layered annotation
• Flat-file representations of corpora are common
• Easy to inspect and (often) easy to process
• Cannot easily store multiple or nested annotation layers
• Custom I/O required for each format
% less nw/wsj/00/wsj_0089.sense
...
nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 1 woman-n ?,? 1
nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 2 write-v 1
nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 6 say-v ?,? 1
nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 13 ring-v 1
nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 63 19 know-v 4
nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 64 2 see-v ?,? 3
nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 64 7 die-v ?,? 1
nw/wsj/00/wsj_0089@0089@wsj@nw@en@on 64 9 bless-v 1
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 5
Representing overlapping, multi-layered annotation
• Flat-file representations of corpora are common
• Easy to inspect and (often) easy to process
• Cannot easily store multiple or nested annotation layers
• Custom I/O required for each format
% less nw/wsj/00/wsj_0089.coref

...
Another women wrote from Sheffield
*PRO*-1 to say that in her 60 years
of ringing , ‘‘ I have never known a
lady to faint in the belfry .
I have seen 
one or two men die , *PRO* bless 
them .


Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 5
Representing overlapping, multi-layered annotation
• Flat-file representations of corpora are common
• Easy to inspect and (often) easy to process
• Cannot easily store multiple or nested annotation layers
• Custom I/O required for each format
% less nw/wsj/00/wsj_0089.prop
...
63 2 gold write-v write.01 ----- 2:0-rel 0:1-ARG0 3:1-ARGM-DIR 5:2-ARGM-PNC
63 7 gold say-v say.01 ----- 7:0-rel 5:0-ARG0 8:1-ARG1
63 20 gold know-v know.01 ----- 20:0-rel 9:1-ARGM-TMP 17:1-ARG0 19:1-ARGM-NEG 21:2-ARG1
63 24 gold faint-v faint.01 ----- 24:0-rel 21:1-ARG0 25:1-ARGM-LOC
64 2 gold see-v see.01 ----- 2:0-rel 0:1-ARG0 3:3-ARG1
64 7 gold die-v die.01 ----- 7:0-rel 3:2-ARG1
64 10 gold bless-v bless.01 ----- 10:0-rel 9:1-ARG0 11:1-ARG1
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 6
Representing overlapping, multi-layered annotation
• Traditional structured storage representations, such as databases,
are less common
• Harder to inspect
• May require specialised search tools to extract information
• Document Representation Frameworks (drfs) aim to provide a
better solution for the storage of corpora
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 7
Document Representation Frameworks
• gate and uima are the main drfs used within the cl community
• Very Java oriented
• Neither work very well in a non-ide context
• Both are quite large frameworks and have a high entry cost
• Leads researchers to roll-their-own makeshift drf instead
• This is obviously not a good thing
• The field lacks a language-agnostic, non-ide friendly, low entry
cost drf
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 8
What is docrep?
• docrep is a programming language agnostic drf specification
• We have provided implementations in C++, Python, and Java
• Available from https://github.com/schwa-lab (mit licence)
• Programming language agnostic framework
• Streaming document representation
• Self-describing on the wire; no external files needed
• Well-established binary serialisation format (MessagePack)
• unix tool philosophy
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 9
docrep has been heavily used
• We have used docrep for while, but the two COLING papers are
the first papers presenting it
• Primary data format used in our research lab since mid-2012
• Both for research and commercial projects
• All of our tools talk docrep
• Common data format and run-time representation across people,
tools, and programming languages
• No custom I/O code required once in docrep format
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 10
Why should you use docrep?
• It looks like the data modelling code you would write anyway
1 class Token(dr.Ann):
2 span = dr.Slice()
3 raw = dr.Text()
4
5 class NamedEntity(dr.Ann):
6 span = dr.Slice(Token)
7 tag = dr.Text()
8 start_offset = dr.Field()
9 end_offset = dr.Field()
10
11 class Doc(dr.Doc):
12 doc_id = dr.Text()
13 tokens = dr.Store(Token)
14 named_entities = dr.Store(NamedEntity)
15
16 reader = dr.Reader(open(filename, 'rb'), Doc)
17 for doc in reader:
18 process_doc(doc)
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 11
Why should you use docrep?
• apis in C++, Python, and Java
1 @dr.Ann
2 public class Token extends AbstractAnn {
3 @dr.Field public ByteSlice span;
4 @dr.Field public String raw;
5 }
6 @dr.Ann
7 public class NamedEntity extends AbstractAnn {
8 @dr.Pointer public Slice span;
9 @dr.Field public String tag;
10 @dr.Field public int startOffset;
11 @dr.Field public int endOffset;
12 }
13 @dr.Doc
14 public class Doc extends AbstractDoc {
15 @dr.Field public String docId;
16 @dr.Store public Store tokens;
17 @dr.Store public Store namedEntities;
18 }
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 12
Why should you use docrep?
• No xml – use native type descriptors
1 
2 ontonotes5.to_uima.types.NamedEntity
3 
4 uima.tcas.Annotation
5 
6 
7 tag
8 uima.cas.String
9 
10 
11 startOffset
12 uima.cas.Integer
13 
14 
15 endOffset
16 uima.cas.Integer
17 
18 
19 
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 13
Why should you use docrep?
• Command-line friendly – no ide required
1 % dr grep 'doc.lang == "en" && len(doc.tokens) < 100' corpus.dr | \
2 % dr count -s named_entities | sort -rn | head -n 1
3 32
• docrep includes a rich set of unix tool style commands for very
common operations
• ⇒ no custom code required for performing common tasks
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 14
Why should you use docrep?
• As compact and efficient as what Google uses (Protocol Buffers)
• Converting the CoNLL 2003 ner corpus to docrep using different
commonly used binary serialisation protocols:
Self- Uncompressed Compressed
describing Time (s) Size (MB) Time (s) Size (MB)
Original data – – 31.30 1.0 5.95
bson X 2.5 188.42 5.3 30.32
MessagePack X 1.6 52.15 3.2 16.61
Protocol Buffers × 1.4 51.51 3.5 18.52
Thrift × 1.0 126.12 3.5 20.64
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 15
Why should you use docrep?
• Lazy loading api
• Only the stores and fields specified at runtime will be deserialised
• Read-only and non-deserialised data streamed as a memcpy
• Very fast modifications to a subset of the document schema
• E.g. adding a new field or annotation layer
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 16
How is docrep different to other drfs?
• Streaming document representation
• Self-describing on the wire; no external files needed
• unix tool philosophy
• Formally can represent annotations over sequential spans of other
annotation instances
• Cannot formally represent cross-document information due to
streaming model
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 17
Slices over bytes and annotation instances
• The Slice field type represents an 〈index, length〉 integer pair
into some ordered list of data
• Slices over the original document byte array
• Slices over a Store of annotation instances
1 class Token(dr.Ann):
2 span = dr.Slice() # Slice over bytes.
3 raw = dr.Text()
4
5 class NamedEntity(dr.Ann):
6 span = dr.Slice(Token) # Slice over objects in the `tokens` store.
7 tag = dr.Text()
8 start_offset = dr.Field()
9 end_offset = dr.Field()
10
11 class Doc(dr.Doc):
12 doc_id = dr.Text()
13 tokens = dr.Store(Token)
14 named_entities = dr.Store(NamedEntity)
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 18
What are docrep annotations able to model?
• Most kinds of data can be stored as attributes of an annotation
• Primitive data types
• Byte and Unicode strings
• Pointers to other Ann instances on the document
• Lists of pointers to other Ann instances
• Byte or annotation slices
1 class ParseNode(dr.Ann):
2 tag = dr.Text()
3 token = dr.Pointer(Token)
4 parent = dr.SelfPointer()
5 children = dr.SelfPointers()
6 score = dr.Field()
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 19
Converting a multi-layered corpus to docrep
• The OntoNotes 5 corpus is a good candidate for comparing the
effectiveness of drfs
• 15 710 documents; 13 109 English, 2002 Chinese, and 599 Arabic
• 8 overlapping annotation layers
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 20
The conversion process
• We converted OntoNotes 5 to docrep and uima
• Performed the same conversion process using all available apis
• uima: Java and C++
• docrep: Java, C++, and Python
• Empirical measurements:
• Conversion time
• Serialisation time
• Size of serialised corpus
• Non-empirical measurements:
• Ease of installation
• Ease of use as a developer
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 21
How does docrep compare empirically?
• docrep is up to 34 times faster than uima
• docrep is up to 9 times more compact than uima
uima
Java Java Java Java C++ C++ C++
xmi xcas bin cbin xmi xcas bin
Conversion time (s) 25 25 25 25 77 77 77
Serialisation time (s) 131 122 2103 76 630 611 695
Size on disk (MB) 1894 3252 1257 99 2141 3252 2135
docrep
Java C++ Python
Conversion time (s) 12 12 27
Serialisation time (s) 61 23 32
Size on disk (MB) 371 371 371
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 22
How does docrep compare as a Java developer?
• uima:
• Download and install Eclipse
• Install the special uima Eclipse plugins
• Use Eclipse plugins to generate uima xml type files
• Use Eclipse plugins to convert uima xml files into Java source files
• Start using your types in Java code
• docrep:
• % mvn install libschwa-java
• Start writing and using your types in Java code
• Supports Java ≥ 1.6
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 23
How does docrep compare as a Python developer?
• uima:
• Doesn’t exist
• docrep:
• % pip install libschwa-python
• Start writing and using your types in Python code
• Supports Python 2.7 and ≥ 3.3
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 24
How does docrep compare as a C++ developer?
• uima:
• This was a very unpleasant process
• Download, compile, install specific versions of multiple very large
third-party libraries
• Apache Portable Runtime, ICU, Apache Xerces, JNI
• Discover there is little or out-of-date documentation
• Read the source code to try and work out how to use it
• docrep:
• OS X: % brew install libschwa
• Other: % ./configure && make && make install
• Start writing and using your types in C++ code
• Written in C++11
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 25
Need for common manipulation tools
• Corpus linguistics often involves repeated tasks
• Filtering
• Counting
• Sorting
• Manipulating
• etc.
• These operations are harder with a binary format
• Need a set of useful manipulation tools
• Ideally, that the user already knows how to use
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 26
Command-line utilities for various tasks
• For people who use unix tools, existing oiafs aren’t idiomatic
• docrep is a streaming format ⇒ unix tool friendly
• Document streams sent between processes using unix pipes
• We provide command-line tools that mimic their unix equivalent
• Python api makes common operations very easy to implement
• These tools make these common operations first-class citizens
• ⇒ No custom code required
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 27
unix tools for manipulating corpora
Command Description Similar in unix
dr grep Select documents grep
dr head Select prefix head
dr tail Select suffix tail
dr less View raw annotations less
dr count Aggregate wc
dr sort Reorder documents sort
dr sample Random documents shuf -n
dr split Partition split
dr format Excerpt printf / awk
dr shell Interactive exploration python
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 28
Using the command-line tools
• Working example of the wsj OntoNotes 5 files
% ls wsj_0001.*
wsj_0001.name wsj_0001.onf wsj_0001.parse
wsj_0001.prop wsj_0001.sense
• Each annotation layer is stored in a separate file
• Each file has a completely different format
• wsj_0001.dr is the equivalent docrep file
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 29
Viewing annotation layers
• How do we inspect the different annotation layers?
• Plain text:
% less wsj_0001.name
• docrep:
% dr less wsj_0001.dr
• docrep file contains all of the different annotation layers
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 30
Viewing annotation layers
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 31
Viewing annotation layers
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 32
Counting annotation instances
• How many of each annotation instance are there on wsj 0001?
• Plain text: different process for each format
% egrep -o '([^() ]+ [^()]+)' wsj_0001.parse | wc -l
31
% egrep -o '<[A-Z]+ TYPE="[^"]*">[^<]+' wsj_0001.name | wc -l
6
• Error prone, especially when dealing with sgml or xml
• docrep:
% dr count -a wsj_0001.dr
ndocs ... named_entities parse_nodes ... tokens
1 ... 6 53 ... 31
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 33
Combining annotations from different documents
• Combine the named entity annotations for all documents together
to send to third-party tool you are evaluating?
• Plain text: not easily, since they’re in sgml/xml.
• Requires a specific script to combine the sgml/xml together
• docrep:
% cat *.dr | the-external-tool
• docrep files can simply be concatenated together
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 34
Partitioning corpora
• Split your corpus into 10 folds for cross-validation.
• Plain text: custom script. Very format dependant.
• docrep:
# Round-robin split the corpus into 10 folds.
% cat *.dr | dr split k 10
# Or add some random shuffling into the mix.
% cat *.dr | dr sort random | dr split k 10
# Or split by an attribute on the document.
% cat *.dr | dr split -t '{key}.dr' py doc.genre
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 35
An interactive Python shell for exploration
• Interpreted language can harness the self-describing
representation
• dr shell for exploring docrep streams interactively
• Reads each document in sequence
• Inspect objects using dir or tab-completion
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 36
Interactive example
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 37
A paradigm for rapid development
• These tools grew organically as we needed them
• Facilitates rapid prototyping
• Facilitates debugging and quality assurance
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 38
Different approaches for different people
1 % dr grep 'doc.id ∼ /x-\d+/' corpus.dr | \
2 % dr count -s tokens | sort -rn | head -n 1
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 39
Conclusion
• We want people to use drfs
• So let’s make it easy for them to do so
• Programming-language agnostic
• Low cost of entry
• No infrastructure or ide requirements
• Rich set of unix tool inspired command-line tools
• Makes docrep easier to use than plain-text files
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21
Corpora docrep Case study Utilities Discussion 40
COLING and Acknowledgements
• OIAF4HLT workshop talk: Saturday, 14:00
• Main COLING talk: Monday, 16:35
• Special thanks to Will Radford and Joel Nothman for their
contributions to docrep over the past years
• This work was supported by ARC Discovery grant DP1097291
and the Capital Markets CRC Computable News project
Tim Dawborn docrep: A lightweight and efficient document representation framework 2014-08-21