Using gazetteers in discriminative information extraction

Authors:
Andrew Smith;Miles Osborne
Affiliations:
University of Edinburgh, United Kingdom;University of Edinburgh, United Kingdom
Venue:
CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Year:
2006

Citing 7
Cited 4

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Language independent NER using a maximum entropy tagger

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Logarithmic opinion pools for conditional random fields

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Reducing weight undertraining in structured discriminative learning

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics

Reducing weight undertraining in structured discriminative learning

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Partitioned logistic regression for spam filtering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic wrapper induction from hidden-web sources with domain knowledge

Proceedings of the 10th ACM workshop on Web information and data management
CCG supertags in factored statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Much work on information extraction has successfully used gazetteers to recognise uncommon entities that cannot be reliably identified from local context alone. Approaches to such tasks often involve the use of maximum entropy-style models, where gazetteers usually appear as highly informative features in the model. Although such features can improve model accuracy, they can also introduce hidden negative effects. In this paper we describe and analyse these effects and suggest ways in which they may be overcome. In particular, we show that by quarantining gazetteer features and training them in a separate model, then decoding using a logarithmic opinion pool (Smith et al., 2005), we may achieve much higher accuracy. Finally, we suggest ways in which other features with gazetteer feature-like behaviour may be identified.