A matrix density based algorithm to hierarchically co-cluster documents and words

Authors:
Bhushan Mandhani;Sachindra Joshi;Krishna Kummamuru
Affiliations:
Indian Institute of Technology, Bombay, India;IBM India Research Lab, New Delhi, India;IBM India Research Lab, New Delhi, India
Venue:
WWW '03 Proceedings of the 12th international conference on World Wide Web
Year:
2003

Citing 9
Cited 13

Algorithms for clustering data

Algorithms for clustering data
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Bipartite graph partitioning and data clustering

Proceedings of the tenth international conference on Information and knowledge management
Document clustering with committees

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical Unsupervised Learning

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning

A hierarchical monothetic document clustering algorithm for summarization and browsing search results

Proceedings of the 13th international conference on World Wide Web
A partitioning based algorithm to fuzzy co-cluster documents and words

Pattern Recognition Letters
Minimum sum-squared residue for fuzzy co-clustering

Intelligent Data Analysis
Possibilistic fuzzy co-clustering of large document collections

Pattern Recognition
A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data

Fuzzy Sets and Systems
Efficiently finding web services using a clustering semantic approach

Proceedings of the 2008 international workshop on Context enabled source and service selection, integration and adaptation: organized with the 17th International World Wide Web Conference (WWW 2008)
Bipartite isoperimetric graph partitioning for data co-clustering

Data Mining and Knowledge Discovery
Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Proceedings of the 18th ACM conference on Information and knowledge management
Dual fuzzy-possibilistic coclustering for categorization of documents

IEEE Transactions on Fuzzy Systems
Mining fuzzy frequent itemsets for hierarchical document clustering

Information Processing and Management: an International Journal
Automatic taxonomy generation: issues and possibilities

IFSA'03 Proceedings of the 10th international fuzzy systems association World Congress conference on Fuzzy sets and systems
Fuzzy relational clustering around medoids: A unified view

Fuzzy Sets and Systems
A new fuzzy co-clustering algorithm for categorization of datasets with overlapping clusters

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper proposes an algorithm to hierarchically cluster documents. Each cluster is actually a cluster of documents and an associated cluster of words, thus a document-word co-cluster. Note that, the vector model for documents creates the document-word matrix, of which every co-cluster is a submatrix. One would intuitively expect a submatrix made up of high values to be a good document cluster, with the corresponding word cluster containing its most distinctive features. Our algorithm looks to exploit this. We have defined matrix density, and our algorithm basically uses matrix density considerations in its working.The algorithm is a partitional-agglomerative algorithm. The partitioning step involves the identification of dense submatrices so that the respective row sets partition the row set of the complete matrix. The hierarchical agglomerative step involves merging the most "similar" submatrices until we are down to the required number of clusters (if we want a flat clustering) or until we have just the single complete matrix left (if we are interested in a hierarchical arrangement of documents). It also generates apt labels for each cluster or hierarchy node. The similarity measure between clusters that we use here for the merging cleverly uses the fact that the clusters here are co-clusters, and is a key point of difference from existing agglomerative algorithms. We will refer to the proposed algorithm as RPSA (Rowset Partitioning and Submatrix Agglomeration). We have compared it as a clustering algorithm with Spherical K-Means and Spectral Graph Partitioning. We have also evaluated some hierarchies generated by the algorithm.