Clustering words with the MDL principle

Authors:
Hang Li;Naoki Abe
Affiliations:
Theory NEC Laboratory, RWCP, NEC, Kawasaki, Japan;Theory NEC Laboratory, RWCP, NEC, Kawasaki, Japan
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Year:
1996

Citing 12
Cited 8

Inferring decision trees using the minimum description length principle

Information and Computation
Poor estimates of context are worse than none

HLT '90 Proceedings of the workshop on Speech and Natural Language
Elements of information theory

Elements of information theory
A Learning Criterion for Stochastic Rules

Machine Learning - Computational learning theory
Class-based n-gram models of natural language

Computational Linguistics
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Structural ambiguity and lexical relations

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Contextual word similarity and estimation from sparse data

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Noun classification from predicate-argument structures

ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
Automatic thesaurus construction based on grammatical relations

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Class-based probability estimation using a semantic hierarchy

Computational Linguistics
Word clustering and disambiguation based on co-occurrence data

Natural Language Engineering
An empirical assessment of semantic interpretation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Word clustering and disambiguation based on co-occurrence data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Anti-aliasing on the web

Proceedings of the 13th international conference on World Wide Web
Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Unsupervised query segmentation using generative language models and wikipedia

Proceedings of the 17th international conference on World Wide Web
Superior and efficient fully unsupervised pattern-based concept acquisition using an unsupervised parser

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of automatically constructing a thesaurus by clustering words based on corpus data. We view this problem as that of estimating a joint distribution over the Cartesian product of a partition of a set of nouns and a partition of a set of verbs, and propose a learning algorithm based on the Minimum Description Length (MDL) Principle for such estimation. We empirically compared the performance of our method based on the MDL Principle against the Maximum Likelihood Estimator in word clustering, and found that the former outperforms the latter. We also evaluated the method by conducting pp-attachment disambiguation experiments using an automatically constructed thesaurus. Our experimental results indicate that such a thesaurus can be used to improve accuracy in disambiguation.