Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Lab on BART:
a Beautiful Anaphora Resolution Toolkit
February 2016
A Setup (15-20 mins)
1. Go to website: http://www.bart-anaphora.org/
2. Scroll down to ‘Downloads’ and click on BART-2.0 to start down-
loading
3. Scroll up and read through the main page
4. On completion of the download, double-click on the downloaded file
(/Downloads/bart-2.0.tar.gz)
5. Extract all contents into a working directory (e.g., C:/Users//)
B Web Demo (15-20 mins)
One of the most straightforward ways for runningBART is in a web browser.
For that, one first needs to start the web server within a command window,
which is done as follows (on MS Windows):
set CLASSPATH=.;dist/BART.jar;libs/*
java -Xmx1024m elkfed.webdemo.BARTServer
Then open a web browser of your choice and type in http://localhost:
8125/index.jsp. Open a new tab on your browser, go to www.bbc.co.uk,
pick an article of your choice and select and copy the text only. Go back
to BART Demo’s tab, click on ‘Create new document...’, paste the news
article text within the text box and click the button ‘Preprocess’. Finally,
you can explore coreference chains by clicking on the ‘coreference’ tab within
the text box and then clicking on a specific markable to see the coreference
chain highlighted in red-ish (also listed in full at the bottom of the text).
Note that a series of processing requests can be sent at the server from
a client program using the protocol ReST (http://rest.elkstein.org/).
One such off-the-shelf client is the simple WWW agent that ships with
Linux, lwp-request (see README).
1
C A more scientific experiment (20+ mins)
It is important to bear in mind that BART is a toolkit that enables Machine-
Learning-based research on coreference, which is currently the most widely
adopted approach to tackle that problem in the scientific community. Hence,
there are two aspects of a typical research experiment with BART that is
worth explaining in more details: the data (i.e., corpus) and the features.
C.1 Corpus
In order to do supervised machine learning, one needs labeled data (i.e.,
corpus). Then the data is used to train and test new models. Every-
thing to do with data is configurable in BART and is set in the file:
config/config.properties (open and skim through this file).
For the sake of this experiment, let us use the ‘sample’ corpus available
within the release (see ./sample/english-preprocessed/train/*.mmax
and ./sample/english-preprocessed/test/*.mmax – see below for how
to open *.mmax documents).
C.2 Features
A machine learning scheme is defined by a model (i.e., classifier) and a set of
features (i.e., extractors), both are specified in file: presets/soonbaseline.xml.
An example of how to specify a classifier is the following:

In the above example ‘type’ refers to the library/toolkit implementing
the ‘learner’, for instance, ‘weka’ is a popular open-source library imple-
menting a number of classifiers.
As for the features, they are set in XML tags called .
For example,

As an exercise, open the file presets/soonbaseline.xml and comment
out (i.e., enclose a given  tag within ) some
features. For instance, a very basic coreference algorithm can be based on
string matches only (i.e., comment out all features except ‘FE StringMatch’
and run a train-test experiment, see next paragraph).
Once the corresponding cofiguration files have been updated to tellBART
where to find the data and what features and model to use, then the follow-
ing command runs a full train-and-test experiment:
set CLASSPATH=.;dist/BART.jar;libs/*
java -Xmx1024m -Delkfed.corpus=sample elkfed.main.XMLExperiment presets/soonbaseline.xml
2
Finally, it is worth mentioning that BART expects input and produces
output in the MMAX format which is a type of stand-off XML. There’s
a tool to load, visualise and annotate documents in such format: http:
//mmax2.sourceforge.net/.
D Advanced Case Study (∼60 mins):
Extending BART to a New Language
Note: This section is optional as it requires background in the Java pro-
gramming language. If you lack such background you are encouraged to team
up with someone who does have Java background and go through the section
together, and vice versa, if you feel confident with Java do look around you
for someone who may be of need for a partner to go through this section.
Extending BART to a new language generally involves three things:
1. developeing a language plug-in in BART,
2. annotating a corpus for coreferences in the new language (so that
BART can be tested on) and
3. developing an NLP pipeline capable of producing the minimal input
required by BART (so that unrestricted text in the new language can
be processed)
In this case study we will focus only on the first point, language plug-in,
but keep in mind that for a proper experiment to test our language plug-in
we would also need the latter two. As a side note, preparing and annotating
data (point 2 above) can involve a substantial amount of work and, indeed,
it is an open area of research – see for example: https://anawiki.essex.
ac.uk/phrasedetectives/ for one creative attempt at turning the problem
of data annotation into a game.
The first thing to do is to load BART sources into a source code edi-
tor/IDE (e.g., Eclipse). Then have a look at the way the English plug-in is
implemented. There are two relevant Java classes:
elkfed.lang.EnglishLanguagePlugin and
elkfed.lang.EnglishLinguisticConstants
(see source files in src/elkfed/lang/). The latter class is just a long list of
linguistic constants, hence, the same list can be reused for a new language by
changing only the values of those constants. The latter class contains logic
about various low-level language-dependent heuristics needed for coreference
resolution. These two classes constitute a language plug-in in BART.
In order to create a new language plug-in, one can make copies of the
aforementioned two classes, rename them to reflect the new language (for in-
stance, SpanishLanguagePlugin and SpanishLinguisticConstants) and
3
override all constant values accordingly. Once these two classes have been
adapted to a new language then the language plug-in should be added as an
option in the Config reading class elkfed.config.ConfigProperties.java
(lines 180 – 196) as follows (see two lines within comments /****/):
public LanguagePlugin getLanguagePlugin() {
if (langPlugin == null) {
String lang = getCorpusProperty("language", "english").toLowerCase();
if (lang.startsWith("eng")) {
langPlugin = new EnglishLanguagePlugin();
} else if (lang.startsWith("ita")) {
langPlugin = new ItalianLanguagePlugin();
} else if (lang.startsWith("deu")) {
langPlugin = new GermanLanguagePlugin();
/******* NEW LANGUAGE PLUG-IN REFERENCE ******/
} else if (lang.startsWith("esp")) {
langPlugin = new SpanishLanguagePlugin();
/*********************************************/
} else if (lang.equals("generic")) {
langPlugin = new GenericLanguagePlugin();
} else {
throw new UnsupportedOperationException("No LanguagePlugin for " + lang);
}
}
return langPlugin;
}
After modifying the class elkfed.config.ConfigProperties.java, the
new language plug-in can be then referenced in the config files as explained
in the previous section.
TASK: Your task is to pick a language of your choice for which there
is no language plug-in in BART already (ideally your mother tongue) and
to adapt the classes
elkfed.lang.EnglishLanguagePlugin and
elkfed.lang.EnglishLinguisticConstants
to that language. Once finished send your code to malexa@essex.ac.uk
and if suitable it may be included in BART’s future releases with due ac-
knowledgement.
4