A Very Fast Method for Clustering Big Text Datasets

Authors:
Frank Lin;William W. Cohen
Affiliations:
Carnegie Mellon Unversity, USA, email: {frank,wcohen}@cs.cmu.edu;Carnegie Mellon Unversity, USA, email: {frank,wcohen}@cs.cmu.edu
Venue:
Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
Year:
2010

Citing 15
Cited 2

Deflation Techniques for an Implicitly Restarted Arnoldi Iteration

SIAM Journal on Matrix Analysis and Applications
Approximating matrix multiplication for pattern recognition tasks

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Graph Clustering Using Multiway Ratio Cut

GD '97 Proceedings of the 5th International Symposium on Graph Drawing
Spectral Grouping Using the Nyström Method

IEEE Transactions on Pattern Analysis and Machine Intelligence
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Clustering via matrix powering

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Classification in Networked Data: A Toolkit and a Univariate Case Study

The Journal of Machine Learning Research
Random walks on the click graph

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A tutorial on spectral clustering

Statistics and Computing
Weighted Graph Cuts without Eigenvectors A Multilevel Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Information Retrieval

Introduction to Information Retrieval
Fast approximate spectral clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Parallel Spectral Clustering in Distributed Systems

IEEE Transactions on Pattern Analysis and Machine Intelligence

Collectively representing semi-structured data from the web

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Deflation-based power iteration clustering

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pair-wise similarities between data points due to the prohibitive cost, time-and space-wise, in operating on a similarity matrix, where the state-of-the-art is at best quadratic in time and in space. We present an extremely fast and simple method also using the power of all pair-wise similarity between data points, and show through experiments that it does as well as previous methods in clustering accuracy, and it does so with in linear time and space, without sampling data points or sparsifying the similarity matrix.