Clustering by committee

Authors:
Dekang Lin;Patrick Andre Pantel
Affiliations:
-;-
Venue:
Clustering by committee
Year:
2003

Citing 0
Cited 18

Scalable Model-Based Clustering for Large Databases Based on Data Summarization

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

The Journal of Machine Learning Research
A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment

Pattern Recognition Letters
Integrating Folksonomies with the Semantic Web

ESWC '07 Proceedings of the 4th European conference on The Semantic Web: Research and Applications
Non-parametric Bayesian areal linguistics

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Helping editors choose better seed sets for entity set expansion

Proceedings of the 18th ACM conference on Information and knowledge management
A survey of evolutionary algorithms for clustering

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Generalized external indexes for comparing data partitions with overlapping categories

Pattern Recognition Letters
Automatic word clustering in Russian texts

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Discovering word senses from text using random indexing

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Unsupervised discovery of negative categories in lexicon bootstrapping

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A supervised method of feature weighting for measuring semantic relatedness

Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Improving the exploration of tag spaces using automated tag clustering

ICWE'11 Proceedings of the 11th international conference on Web engineering
A proximity measure and a clustering method for concept extraction in an ontology building perspective

ISMIS'06 Proceedings of the 16th international conference on Foundations of Intelligent Systems
Evaluation method for automated wordnet expansion

SIIS'11 Proceedings of the 2011 international conference on Security and Intelligent Information Systems
Evaluation of clustering algorithms for word sense disambiguation

International Journal of Data Analysis Techniques and Strategies
Corpus-Based semantic filtering in discovering derivational relations

AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Learning concept hierarchies from textual resources for ontologies construction

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text contains a wealth of knowledge about who we are, what we know, how we think, and how we communicate. We are just beginning to tap into the information that is available in the tales we read to our children, the narratives that capture our thoughts, and the stories that shape our world. In this work, we present some recent advances in automatically acquiring knowledge from text. We propose a general-purpose clustering algorithm called CBC (Clustering By Committee) from which we will organize documents according to topics as well as discover concepts and word senses. We will explore the value of these systems by experimenting with two novel evaluation methodologies that attempt to define what a word sense is and define the quality of a particular clustering. CBC addresses the general goal of clustering, which is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. Using sets of representative elements called committees, CBC attempts to discover cluster centroids that unambiguously describe the members of a possible class. CBC will be shown to outperform several common clustering algorithms in document clustering and concept discovery tasks. Document clustering is practical in many information retrieval tasks such as document browsing and the organization and viewing of retrieval results. Broad-coverage lexical resources such as WordNet are extremely useful but are mostly hand generated. They often include many rare senses while missing domain-specific senses. Automatically generating them is useful for many applications such as word sense disambiguation, question answering and ontology construction. Sample concepts discovered by CBC include baking ingredients, symptoms, academic departments, Impressionists, Canadian provinces, musical instruments, and emotions. We present two novel evaluation methodologies. The first is based on the editing distance between output clusters and a manually constructed answer key. It defines how much work is necessary in order to convert from one to the other. For the word sense discovery system, we present an evaluation methodology for measuring the precision and recall of discovered senses. Using WordNet, we formulate what is a correct sense of a word.