Hierarchical clustering of words

Authors:
Akira Ushioda
Affiliations:
ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Year:
1996

Citing 3
Cited 20

Class-based n-gram models of natural language

Computational Linguistics
Natural language parsing as statistical pattern recognition

Natural language parsing as statistical pattern recognition
Beyond skeleton parsing: producing a comprehensive large-scale general-English treebank with full grammatical analysis

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1

Similarity-Based Models of Word Cooccurrence Probabilities

Machine Learning - Special issue on natural language learning
The ATRACT Workbench: Automatic Term Recognition and Clustering for Terms

TSD '01 Proceedings of the 4th International Conference on Text, Speech and Dialogue
Identifying terms by their family and friends

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A methodology for terminology-based knowledge acquisition and integration

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Unsupervised learning of generalized names

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Automatic discovery of term similarities using pattern mining

COMPUTERM '02 COLING-02 on COMPUTERM 2002: second international workshop on computational terminology - Volume 14
Selecting text features for gene name classification: from documents to terms

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Terminology-based knowledge mining for new knowledge discovery

ACM Transactions on Asian Language Information Processing (TALIP)
MIMA search: a structuring knowledge system towards innovation for engineering education

COLING-ACL '06 Proceedings of the COLING/ACL on Interactive presentation sessions
Using lexical dependency and ontological knowledge to improve a detailed syntactic and semantic tagger of English

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
XML tag information management system: a workbench for ontology-based knowledge acquisition and integration

HLT '02 Proceedings of the second international conference on Human Language Technology Research
A symbolic approach to automatic multiword term structuring

Computer Speech and Language
Consensus clustering using spectral theory

ICONIP'08 Proceedings of the 15th international conference on Advances in neuro-information processing - Volume Part I
A composite kernel for named entity recognition

Pattern Recognition Letters
Word representations: a simple and general method for semi-supervised learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Methodological Review: Natural Language Processing methods and systems for biomedical ontology learning

Journal of Biomedical Informatics
A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition

Knowledge-Based Systems
Hierarchical verb clustering using graph factorization

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
User behaviour-driven group formation through case-based reasoning and clustering

Expert Systems with Applications: An International Journal
A hybrid heuristic for the k-medoids clustering problem

Proceedings of the 14th annual conference on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a data-driven method for hierarchical clustering of words in which a large vocabulary of English words is clustered bottom-up, with respect to corpora ranging in size from 5 to 50 million words, using a greedy algorithm that tries to minimize average loss of mutual information of adjacent classes. The resulting hierarchical clusters of words are then naturally transformed to a bit-string representation of (i.e. word bilts for) all the words in the vocabulary. Introducing word bits into the ATR Decision-Tree POS Tagger is shown to significantly reduce the tagging error rate. Portability of word bits from one domain to another is also disscussed.