Improved Unsupervised Name Discrimination with Very Wide Bigrams and Automatic Cluster Stopping

Authors:
Ted Pedersen
Affiliations:
University of Minnesota, Duluth, USA MN 55812
Venue:
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 8
Cited 0

Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
One sense per discourse

HLT '91 Proceedings of the workshop on Speech and Natural Language
Selecting the "right" number of senses based on clustering criterion functions

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Evaluation of utility of LSA for word sense discrimination

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Significant lexical relationships

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
An unsupervised language independent method of name discrimination using second order co-occurrence features

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Name discrimination by clustering similar contexts

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We cast name discrimination as a problem in clustering short contexts. Each occurrence of an ambiguous name is treated independently, and represented using second---order context vectors. We calibrate our approach using a manually annotated collection of five ambiguous names from the Web, and then apply the learned parameter settings to three held-out sets of pseudo-name data that have been reported on in previous publications. We find that significant improvements in the accuracy of name discrimination can be achieved by using very wide bigrams, which are ordered pairs of words with up to 48 intervening words between them. We also show that recent developments in automatic cluster stopping can be used to predict the number of underlying identities without any significant loss of accuracy as compared to previous approaches which have set these values manually.