Semi-automated named entity annotation

Authors:
Kuzman Ganchev;Fernando Pereira;Mark Mandel;Steven Carroll;Peter White
Affiliations:
University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;Children's Hospital of Philadelphia, Philadelphia, PA;Children's Hospital of Philadelphia, Philadelphia, PA
Venue:
LAW '07 Proceedings of the Linguistic Annotation Workshop
Year:
2007

Citing 7
Cited 2

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Facilitating treebank annotation using a statistical parser

HLT '01 Proceedings of the first international conference on Human language technology research
Building a large-scale annotated Chinese corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Online large-margin training of dependency parsers

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Corrective feedback and persistent learning for information extraction

Artificial Intelligence
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
A semi-automatic method for annotating a biomedical proposition bank

LAC '06 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006

Assessing the benefits of partial automatic pre-labeling for frame-semantic annotation

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
Is it worth the effort? Assessing the benefits of partial automatic pre-labeling for frame-semantic annotation

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate a way to partially automate corpus annotation for named entity recognition, by requiring only binary decisions from an annotator. Our approach is based on a linear sequence model trained using a k-best MIRA learning algorithm. We ask an annotator to decide whether each mention produced by a high recall tagger is a true mention or a false positive. We conclude that our approach can reduce the effort of extending a seed training corpus by up to 58%.