Document clustering using small world communities

Authors:
Brant W. Chee;Bruce Schatz
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Year:
2007

Citing 11
Cited 1

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Readings in information retrieval

Readings in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Semantic indexing for a complete subject discipline

Proceedings of the fourth ACM conference on Digital libraries
Data clustering: a review

ACM Computing Surveys (CSUR)
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
K-clustering in wireless ad hoc networks

Proceedings of the second ACM international workshop on Principles of mobile computing
Selection, tinkering, and emergence in complex networks

Complexity - Special issue: Selection, tinkering, and emergence in complex networks
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Mining scale-free networks using geodesic clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving quality of search results clustering with approximate matrix factorisations

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Clustering of document collection - A weighting approach

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Words in natural language documents exhibit a small world network structure. Thus the physics community provides us with an extensive supply of algorithms for extracting community structure. We present a novel method for semantically clustering a large collection of documents using small world communities. This method combines modified physics algorithms with traditional information retrieval techniques. A term network is generated from the document collection, the terms are clustered into small world communities, the semantic term clusters are used to generate overlapping document clusters. The algorithm combines the speed of single link with the quality of complete link. Clustering takes place in nearly real-time and the results are judged to be coherent by expert users. Our algorithm occupies a middle ground between speed and quality of document clustering.