Word clustering and disambiguation based on co-occurrence data

Authors:
Hang Li
Affiliations:
Theory NEC Laboratory, Real World Computing Partnership, c//o Internet Systems Research Laboratories, NEC Corporation, 4-1-1 Miyazaki, Miyamae-ku, Kawasaki 216-8555, Japan/ e-mail: hangli@mi ...
Venue:
Natural Language Engineering
Year:
2002

Citing 25
Cited 10

A Learning Criterion for Stochastic Rules

Machine Learning - Computational learning theory
Class-based n-gram models of natural language

Computational Linguistics
Selection and information: a class-based approach to lexical relationships

Selection and information: a class-based approach to lexical relationships
Training and scaling preference functions for disambiguation

Computational Linguistics
Improving statistical language model performance with automatically generated word hierarchies

Computational Linguistics
Explorations in Automatic Thesaurus Discovery

Explorations in Automatic Thesaurus Discovery
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Inducing Probabilistic Grammars by Bayesian Model Merging

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Generalizing case frames using a thesaurus and the MDL principle

Computational Linguistics
Automatic learning for semantic collocation

ANLC '92 Proceedings of the third conference on Applied natural language processing
Word clustering and disambiguation based on co-occurrence data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Statistical models for unsupervised prepositional phrase attachment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Structural ambiguity and lexical relations

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Similarity-based estimation of word cooccurrence probabilities

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Noun classification from predicate-argument structures

ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
Generalizing automatically generated selectional patterns

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
A rule-based approach to prepositional phrase attachment disambiguation

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Clustering words with the MDL principle

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Inducing a semantically annotated lexicon via EM-based clustering

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
An unsupervised approach to prepositional phrase attachment using contextually similar words

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A maximum entropy model for prepositional phrase attachment

HLT '94 Proceedings of the workshop on Human Language Technology
Automatic thesaurus construction based on grammatical relations

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Information retrieval based on context distance and morphology

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity

Computational Linguistics
An efficient clustering algorithm for class-based language models

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
A generalized framework for revealing analogous themes across related topics

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Word Clustering for Collocation-Based Word Sense Disambiguation

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
A word clustering approach for language model-based sentence retrieval in question answering systems

Proceedings of the 18th ACM conference on Information and knowledge management
Spectral clustering for Chinese word

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 1
Long distance bigram models applied to word clustering

Pattern Recognition
A nearest-neighbor method for resolving PP-Attachment ambiguity

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
A novel neighborhood based document smoothing model for information retrieval

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and conducting syntactic disambiguation by using the acquired word classes. We view the clustering problem as that of estimating a class-based probability distribution specifying the joint probabilities of word pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability model. Our clustering method is a natural extension of that proposed in Brown, Della Pietra, deSouza, Lai and Mercer (1992). We next propose a syntactic disambiguation method which combines the use of automatically constructed word classes and that of a hand-made thesaurus. The overall disambiguation accuracy achieved by our method is 88.2%, which compares favorably against the accuracies obtained by the state-of-the-art disambiguation methods.