Focused training sets to reduce noise in NER feature models
Amber McKenzie
Feature and context aggregation play a large role in current NER systems,
allowing significant opportunities for research into op-timizing these features
to cater to different domains. This work strives to reduce the noise introduced
into aggregated features from dis-parate and generic training data in order to
al-low for contextual features that more closely model the entities in the
target data. The pro-posed approach trains models based on only a part of the
training set that is more similar to the target domain. To this end, models are
trained for an existing NER system using the top documents from the training
set that are similar to the target document in order to demonstrate that this
technique can be applied to improve any pre-built NER system. Initial results
show an improvement over the Illinois tagger with a weighted average F1 score
of 91.67 compared to the University of Illinois NE tagger’s score of 91.32.
This research serves as a proof-of-concept for future planned work to cluster
the training documents to produce a number of more focused models from a given
training set, thereby reducing noise and extracting a more representative
feature set.
Back to Papers Accepted