Exploring phrasal context and error correction heuristics in bootstrapping for geographic named entity annotation

Authors:
Seungwoo Lee;Gary Geunbae Lee
Affiliations:
Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, Pohang, 790-784, Republic of Korea;Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, Pohang, 790-784, Republic of Korea
Venue:
Information Systems
Year:
2007

Citing 14
Cited 3

WordNet: a lexical database for English

Communications of the ACM
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Gimme' the context: context-driven automatic semantic annotation with C-PANKOW

WWW '05 Proceedings of the 14th international conference on World Wide Web
Unsupervised learning of generalized names

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Named entity recognition using an HMM-based chunk tagger

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
A bootstrapping approach to named entity classification using successive learners

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
One sense per discourse

HLT '91 Proceedings of the workshop on Speech and Natural Language
Exploiting strong syntactic heuristics and co-training to learn semantic lexicons

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A bootstrapping method for learning semantic lexicons using extraction pattern contexts

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Experiments with geographic knowledge for information extraction

HLT-NAACL-GEOREF '03 Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1
Semi-supervised learning of geographical gazetteers from the internet

HLT-NAACL-GEOREF '03 Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1
InfoXtract location normalization: a hybrid approach to geographic references in information extraction

HLT-NAACL-GEOREF '03 Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1
Bootstrapping toponym classifiers

HLT-NAACL-GEOREF '03 Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

An alignment-based approach to semi-supervised relation extraction including multiple arguments

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Towards heterogeneous resources-based ambiguity reduction of sub-typed geographic named entities

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Improving Korean verb-verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts

Pattern Recognition Letters

Quantified Score

Hi-index	0.01

Visualization

Abstract

Geographic named entities can be classified into many sub-types that are useful for applications such as information extraction and question answering. In this paper, we present a high-performance bootstrapping algorithm with error correction heuristics and location normalization for the task of geographic named entity annotation with seven sub-types. Location normalization additionally resolves ambiguities of entities with same name and sub-types. In the initial stage, we annotate a raw corpus using a large set of seeds which is automatically selected from a gazetteer so that its quality does not depend on a specific training corpus. From the initial annotation, boundary patterns reflecting phrasal context are learned and applied to the corpus again to obtain new annotation which passes through error correction heuristics. As the bootstrapping loop proceeds, the annotated instances are gradually increased and the learned boundary patterns become gradually richer and more accurate. Through experiments, we explore inter/intra-phrasal context which reflects syntactic constraints of a named entity and several heuristic knowledge for correcting annotation errors introduced by incomplete boundary patterns. The experiments show the effect of the strategies on the learning curve. When our bootstrapping approach was applied to a newspaper corpus, it could achieve 89 F1 value. And the method suggested for location normalization could achieve 95% accuracy at instance level.