ELIE is a tool for adaptive information extraction from text. It also provides a number of other text processing tools e.g. POS tagging, chunking, gazetteer, stemming.
This work is open source and is licensed under a Creative Commons License.
Download version 0.4.1 or view on github
If you want to use the preprocessor, you will need to download and install brilltag.
If you would like to be kept informed of future releases of this system, or have any comments or queries, send me an email.
References
Finn, A. & Kushmerick, N. (2004). Multi-level Boundary Classification for Information Extraction. In Proc. European Conference on Machine Learning
(Pisa). pdf
Finn, A. & Kushmerick, N. (2004). Information Extraction by Convergent Boundary Classification. AAAI-04 Workshop on Adaptive Text Extraction and Mining (San Jose). pdf
Using Elie
Elie is a tool for adaptive information extraction. It
also provides a number of other text processing tools
e.g. POS tagging, chunking, gazetteer, stemming. It is
written in Python.
1.1 Installation
Requirements:
* Python 2.1 or higher
* Java 2 or higher
* Weka (included in distribution)
* Brilltag (if you intend to use datasets other than
those provided)
Unzip the Elie archive. Edit the basedir, BRILLTAGPATH
and java variables in the file config.py to describe
your own system. Add $ELIEHOME/lib/weka.jar to your
java classpath.
1.2 Usage
Elie contains the following executable files
evaluation.py The main way to run Elie.
scorer.py Calculate performance measures from Elie logs.
extractor.py Performs basic learning and extraction.
preprocessCorpus.py Preprocesses a corpus of text files.
tagging.py Does POS, chunking etc. on a text file.
You can execute these files without any arguments to
get usage information.
1.2.1 Input Format
Documents should be stored in text files with one
document per text-file. Fields should be marked using
the syntax <field> ... </field>.
1.2.2 Preprocessing
This stage adds tokenization, orthographic, POS,
chunking and gazetteer information to the input files
and stores it using Elie's own format. This stage only
needs to be done once for each document collection.
Running
preprocessCorpus.py datasetDirectory
will create a new directory called
datasetDirectory.preprocessed which contains all the
files in Elie's internal format. Note the input files
shouldn't contain any unusual control characters and
for every <field>
there must be a corresponding </field>
.
1.2.3 Running Elie
The recommended way to run Elie is using the file
evaluation.py. It takes the following parameters.
-f field
A list of the fields to be extracted surrounded by
quotes e.g. "speaker stime etime location"
-t trainCorpusDirectory
The directory that contains the pre-processed corpus.
-D dataDirectory
The directory to save Elie's output and temporary files in.
[-T testCorpusDirectory]
Optionally specify a directory that contains the
pre-processed test corpus. If no test corpus is
specified Elie will do a random split of the training
corpus.
[-s splitfilebase]
Specify a set of pre-defined splits for the training data.
If -t and -T are are set, then Elie will train on
trainCorpusDirectory and test on testCorpusDirectory.
Otherwise it will do repeated random splits on
trainCorpusDirectory. Other options include:
-p train_proportion
For a random split experiment set the proportion of the
data to use for training. The default value is 0.5.
-n number_of_trials
For a random split experiment set the number of trials.
The default value is 10.
-v version info
-h help
The corpora directories should contain preprocessed
files only i.e. those created by preprocessCorpus.py.
The dataDirectory is where ELIE will store all its
intermediate and output files. The splitfilebase
argument can used be for predefined splits.
1.3 Output
The detail of Elie's printed output is controlled using
the parameter config.verbosity.
Elie produces several logfiles that can be used by the
bwi-scorer or Elie's own scorer (scorer.py). The
logfile names have form
name.field.elie.number.level.log.
The split files name has the form
elie.field.number.split. Each split-file lists the name
of each training file, one per line, followed by a
separator, followed by the name of each test file, one
per line.
These are located in the specified dataDirectory. For a
random split experiment Elie will produce a split file
for each iteration. Each split file lists the files
used for training and testing. To use pre-defined
splits, pass the base of the splitfiles using the -s option.
1.4 Configuration
The file config.py contains all the configuration
options. In this section we describe these parameters
and their default values.
The config.py file contains several constants that Elie uses.
basedir = '/home/aidan/IE/Elie5'
This is the full path to the directory where Elie is installed
BRILLTAGPATH='/usr/Brilltag/Bin_and_Data/tagger'
This is the full path to the Brilltag tagger binary.
verbosity = 2
This controls the level of output that ELIE produces.
Higher numbers produce more output. It takes values 0
to 5.
java = `java -mx1900000000 -oss1900000000 '
This is the command to call the java runtime. You can
add any java parameters here. It is a good idea to
allocate plenty of memory to the java interpreter.
use_psyco = 0
This can have values 0 or 1. Psyco is a program for
dynamically compiling python scripts for improved
execution time. Enabling psyco will make Elie run
faster but will use a lot more memory. On large
experiments this doesn't give much improvement as most
of the time is spent inside WEKA.
learner = 'SMO'
This setting controls which learning algorithm is used.
SMO is the default. Available options are: `knn', `m5',
`kstar', `hyper', `m5rules', `j48', `OneR', `neural',
`winnow', `LMT', `jrip', `SMO', `prism', `PART',
`ridor', `bayes'.
The punctuation, symbols, lbrackets, rbrackets, quotes,
longword, usable_tags, reserved_characters and
special_tokens parameters are constants that control
the behavior of the tokenizer and preprocessor. In
general they shouldn't be changed.
The following options control Elie's behavior. These
are the only options that the user needs to change
after installation.
window = 4
This controls the number of tokens for which relation
information before and after the current token is encoded.
m_window = 10
This controls the length of the L2 window: How many
instances before and end and after a start to use for training.
stem = 0
suffix = 0
Whether to use the token stems and token suffixes as features.
token = 1
pos = 1
types = 1
gaz = 1
chunk = 1
erc = 1
These control which feature-sets to use. Set a value to
0 to disable using those features.
filter_n_attributes = 5000
This controls how many attributes to use for learning.
We can set it to use the top n features as ranked by
Information Gain.
filter_threshold = 0
We can set a threshold here for attribute filtering.
E.g. setting this to 0.1 would mean that we use the top
10% of attributes as ranked by Information Gain.
undersample = 0
This controls whether to use random undersampling of
instances. Setting it to 0.8 would randomly delete 80%
of the negative instances
prune_instances = 0
This controls whether to prune uninformative instances.
Setting it to 80 would prune 80% of the instances as
ranked by the informativeness of the word token.
1.5 Examples
Elie takes input documents that are in its own format.
This format adds the gazetteer, POS, orthographic
features etc. To translate a corpus into this format we
use the preprocessCorpus.py command.
preprocessCorpus.py ./train
This creates a new directory called
./train.preprocessed which contains processed versions
of all the files that were in ./train. This only needs
to be done once per corpus.
evaluation.py -t ./train.preprocessed -T ./test.preprocessed -D ./tmp -f `speaker stime etime location'
This command does a single train-test run using the
files in train.preprocessed for training and the files
in test.preprocessed for testing. The log files are
stored in ./tmp. Four fields are extracted: speaker,
stime, etime, location.
evaluation.py -t ./train.preprocessed -D ./results -n 1 -p 0.8 -f `speaker stime etime location'
This does a single random test/train split. The files
in train.preprocessed are randomly assigned to the
train or test set with 80% of them assigned to the
train set and 20% to the test set. The log files and
the split files are stored in ./results
evaluation.py -t ./train.preprocessed -D ./tmp -s ./tmp/elie.speaker. -f `speaker stime etime location'
In this example we use the -s option to tell Elie to
use predefined train-test splits. The split files
define which files from ./train.preprocessed are
allocated to the train and test sets. The -s option
takes the base of the splitfile name. Splitfile names
end in .split and should be formated as
elie.field.splitnumber.split so the above example
matches all files that match ./tmp/elie.speaker.*.split
evaluation.py -t ./train.preprocessed -D ./tmp -s ./tmp/elie.speaker.[1-5] -f `speaker stime etime location'
We can also add regular expressions to the splitfile
base. The above example matches splitfiles where the
base is elie.speaker. and the split number starts with
1, 2, 3, 4 or 5.
After running the above experiment all the log files
will be stored in ./tmp. Once the experiment is
complete we can use scorer.py to examine the
performance. To view the L1 performance we issue the command:
scorer.py ./tmp/elie.speaker.*.elie.L1.log
To view the L2 performance we would use the following command:
scorer.py ./tmp/elie.speaker.*.elie.L2.log