Scalable Model-Based Clustering for Large Databases Based on Data Summarization
IEEE Transactions on Pattern Analysis and Machine Intelligence
A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
The Journal of Machine Learning Research
Pattern Recognition Letters
Integrating Folksonomies with the Semantic Web
ESWC '07 Proceedings of the 4th European conference on The Semantic Web: Research and Applications
Non-parametric Bayesian areal linguistics
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Helping editors choose better seed sets for entity set expansion
Proceedings of the 18th ACM conference on Information and knowledge management
A survey of evolutionary algorithms for clustering
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Generalized external indexes for comparing data partitions with overlapping categories
Pattern Recognition Letters
Automatic word clustering in Russian texts
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Discovering word senses from text using random indexing
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Unsupervised discovery of negative categories in lexicon bootstrapping
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A supervised method of feature weighting for measuring semantic relatedness
Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Improving the exploration of tag spaces using automated tag clustering
ICWE'11 Proceedings of the 11th international conference on Web engineering
ISMIS'06 Proceedings of the 16th international conference on Foundations of Intelligent Systems
Evaluation method for automated wordnet expansion
SIIS'11 Proceedings of the 2011 international conference on Security and Intelligent Information Systems
Evaluation of clustering algorithms for word sense disambiguation
International Journal of Data Analysis Techniques and Strategies
Corpus-Based semantic filtering in discovering derivational relations
AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Learning concept hierarchies from textual resources for ontologies construction
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
Text contains a wealth of knowledge about who we are, what we know, how we think, and how we communicate. We are just beginning to tap into the information that is available in the tales we read to our children, the narratives that capture our thoughts, and the stories that shape our world. In this work, we present some recent advances in automatically acquiring knowledge from text. We propose a general-purpose clustering algorithm called CBC (Clustering By Committee) from which we will organize documents according to topics as well as discover concepts and word senses. We will explore the value of these systems by experimenting with two novel evaluation methodologies that attempt to define what a word sense is and define the quality of a particular clustering. CBC addresses the general goal of clustering, which is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. Using sets of representative elements called committees, CBC attempts to discover cluster centroids that unambiguously describe the members of a possible class. CBC will be shown to outperform several common clustering algorithms in document clustering and concept discovery tasks. Document clustering is practical in many information retrieval tasks such as document browsing and the organization and viewing of retrieval results. Broad-coverage lexical resources such as WordNet are extremely useful but are mostly hand generated. They often include many rare senses while missing domain-specific senses. Automatically generating them is useful for many applications such as word sense disambiguation, question answering and ontology construction. Sample concepts discovered by CBC include baking ingredients, symptoms, academic departments, Impressionists, Canadian provinces, musical instruments, and emotions. We present two novel evaluation methodologies. The first is based on the editing distance between output clusters and a manually constructed answer key. It defines how much work is necessary in order to convert from one to the other. For the word sense discovery system, we present an evaluation methodology for measuring the precision and recall of discovered senses. Using WordNet, we formulate what is a correct sense of a word.