Word clustering and disambiguation based on co-occurrence data

Authors:
Hang Li;Naoki Abe
Affiliations:
Theory NEC Laboratory, Real World Computing Partnership, NEC, Kawasaki, Japan;Theory NEC Laboratory, Real World Computing Partnership, NEC, Kawasaki, Japan
Venue:
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Year:
1998

Citing 7
Cited 23

Class-based n-gram models of natural language

Computational Linguistics
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Generalizing case frames using a thesaurus and the MDL principle

Computational Linguistics
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A rule-based approach to prepositional phrase attachment disambiguation

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Clustering words with the MDL principle

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Automatic thesaurus construction based on grammatical relations

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Word sense disambiguation in information retrieval revisited

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Word clustering and disambiguation based on co-occurrence data

Natural Language Engineering
Improvements to the Linear Programming Based Scheduling of Web Advertisements

Electronic Commerce Research
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Frequency estimates for statistical word similarity measures

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
A transformational-based learner for dependency grammars in discharge summaries

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Using co-composition for acquiring syntactic and semantic subcategorisation

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Two-dimensional clustering for text categorization

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
An efficient clustering algorithm for class-based language models

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Clustering Syntactic Positions with Similar Semantic Requirements

Computational Linguistics
Ontology learning: state of the art and open issues

Information Technology and Management
An algorithm for unsupervised topic discovery from broadcast news stories

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Applications of corpus-based semantic similarity and word segmentation to database schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Using hidden Markov random fields to combine distributional and pattern-based word clustering

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Graph-based word clustering using a web search engine

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Unsupervised methods for determining object and relation synonyms on the web

Journal of Artificial Intelligence Research
Context comparison as a minimum cost flow problem

TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Entity Resolution in Texts Using Statistical Learning and Ontologies

ASWC '09 Proceedings of the 4th Asian Conference on The Semantic Web
A graph-theoretic framework for semantic distance

Computational Linguistics
PAC-Bayesian Analysis of Co-clustering and Beyond

The Journal of Machine Learning Research
On context-aware co-clustering with metadata support

Journal of Intelligent Information Systems
Distributional thesaurus versus wordnet: a comparison of backoff techniques for unsupervised PP attachment

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Wikification via link co-occurrence

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and using the acquired word classes to improve the accuracy of syntactic disambiguation. We view this problem as that of estimating a joint probability distribution specifying the joint probabilities of word pairs, such as noun verb pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability distribution. Our method is a natural extension of those proposed in (Brown et al., 1992) and (Li and Abe, 1996), and overcomes their drawbacks while retaining their advantages. We then combined this clustering method with the disambiguation method of (Li and Abe, 1995) to derive a disambiguation method that makes use of both automatically constructed thesauruses and a hand-made thesaurus. The overall disambiguation accuracy achieved by our method is 85.2%, which compares favorably against the accuracy (82.4%) obtained by the state-of-the-art disambiguation method of (Brill and Resnik, 1994).