Document clustering with committees

Authors:
Patrick Pantel;Dekang Lin
Affiliations:
University of Alberta, Edmonton, Alberta, Canada;University of Alberta, Edmonton, Alberta, Canada
Venue:
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2002

Citing 12
Cited 52

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Data clustering: a review

ACM Computing Surveys (CSUR)
Data mining: concepts and techniques

Data mining: concepts and techniques
Information Retrieval

Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Clustering with Instance-level Constraints

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics

A matrix density based algorithm to hierarchically co-cluster documents and words

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Document clustering based on cluster validation

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Rule-based word clustering for document metadata extraction

Proceedings of the 2005 ACM symposium on Applied computing
Scalable hierarchical topic detection: exploring a sample based approach

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Combining preference- and content-based approaches for improving document clustering effectiveness

Information Processing and Management: an International Journal
A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Text mining without document context

Information Processing and Management: an International Journal - Special issue: Informetrics
Using cluster validation criterion to identify optimal feature subset and cluster number for document clustering

Information Processing and Management: an International Journal
A comparison of alternative parse tree paths for labeling semantic roles

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach

Journal of Management Information Systems
Biomedical ontology improves biomedical literature clustering performance: a comparison study

International Journal of Bioinformatics Research and Applications
A collaborative filtering-based approach to personalized document clustering

Decision Support Systems
A Latent Semantic Indexing-based approach to multilingual document clustering

Decision Support Systems
Winnowing-based text clustering

Proceedings of the 17th ACM conference on Information and knowledge management
Finding cohesive clusters for analyzing knowledge communities

Knowledge and Information Systems
Managing Word Mismatch Problems in Information Retrieval: A Topic-Based Query Expansion Approach

Journal of Management Information Systems
Exploiting noun phrases and semantic relationships for text document clustering

Information Sciences: an International Journal
Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
A survey of Web clustering engines

ACM Computing Surveys (CSUR)
Preserving User Preferences in Automated Document-Category Management: An Evolution-Based Approach

Journal of Management Information Systems
An Approach to Web-Scale Named-Entity Disambiguation

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Computing term translation probabilities with generalized latent semantic analysis

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Avoiding Bias in Text Clustering Using Constrained K-means and May-Not-Links

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Answer typing for information retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Exploiting corpus-related ontologies for conceptualizing document corpora

Journal of the American Society for Information Science and Technology
Automatic generation of information-seeking questions using concept clusters

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Unsupervised learning of narrative schemas and their participants

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Combining preference- and content-based approaches for improving document clustering effectiveness

Information Processing and Management: an International Journal
GOClonto: An ontological clustering approach for conceptualizing PubMed abstracts

Journal of Biomedical Informatics
Analyzing knowledge communities using foreground and background clusters

ACM Transactions on Knowledge Discovery from Data (TKDD)
A hybrid incremental clustering method-combining support vector machine and enhanced clustering by committee clustering algorithm

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Term committee based event identification within news topics

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
A knowledge-driven approach to biomedical document conceptualization

Artificial Intelligence in Medicine
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research
Improving alternative text clustering quality in the avoiding bias task with spectral and flat partition algorithms

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Comprehensible and accurate cluster labels in text clustering

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
A novel approach for research paper abstracts summarization using cluster based sentence extraction

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Using a Wikipedia-based semantic relatedness measure for document clustering

TextGraphs-6 Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing
Clustering large collection of biomedical literature based on ontology-enriched bipartite graph representation and mutual refinement strategy

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Mission-based navigational behaviour modeling for web recommender systems

WebKDD'04 Proceedings of the 6th international conference on Knowledge Discovery on the Web: advances in Web Mining and Web Usage Analysis
Multi-document summarization based on BE-Vector clustering

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Document clustering with grouping and chaining algorithms

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Information retrieval from the web: an interactive paradigm

MIS'05 Proceedings of the 11th international conference on Advances in Multimedia Information Systems
An experimental study of constrained clustering effectiveness in presence of erroneous constraints

Information Processing and Management: an International Journal
Phrase clustering without document context

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Space projections as distributional models for semantic composition

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Looking at word meaning: an interactive visualization of semantic vector spaces for Dutch synsets

EACL 2012 Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH
Ontology learning: revisted

Journal of Web Engineering
Mining entity attribute synonyms via compact clustering

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Unsupervised identification of synonymous query intent templates for attribute intents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Locality mutual clustering for document retrieval

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.