Text document clustering based on neighbors

Authors:
Congnan Luo;Yanjun Li;Soon M. Chung
Affiliations:
Teradata Corporation, San Diego, CA 92127, USA;Department of Computer and Information Science, Fordham University, Bronx, NY 10458, USA;Department of Computer Science and Engineering, Wright State University, Dayton, OH 45435, USA
Venue:
Data & Knowledge Engineering
Year:
2009

Citing 22
Cited 5

Algorithms for clustering data

Algorithms for clustering data
Similarity measures in scientometric research: the Jaccard index versus Salton's cosine formula

Information Processing and Management: an International Journal
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Approximation algorithms for min-sum p-clustering

Discrete Applied Mathematics
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Approximating min-sum k-clustering in metric spaces

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Information Retrieval

Information Retrieval
Clustering Algorithms

Clustering Algorithms
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
Parallel bisecting k-means with prediction clustering algorithm

The Journal of Supercomputing
Clustering Using a Similarity Measure Based on Shared Near Neighbors

IEEE Transactions on Computers
Text document clustering based on frequent word meaning sequences

Data & Knowledge Engineering
Usage of Mined Word Associations for Text Retrieval

ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
A Novel Document Clustering Model Based on Latent Semantic Analysis

SKG '07 Proceedings of the Third International Conference on Semantics, Knowledge and Grid
Text Clustering with Feature Selection by Using Statistical Data

IEEE Transactions on Knowledge and Data Engineering
The complexity of the generalized Lloyd - Max problem (Corresp.)

IEEE Transactions on Information Theory
A simple heuristic for the p-centre problem

Operations Research Letters

A novel initialization method for semi-supervised clustering

KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Solving multi-label text categorization problem using support vector machine approach with membership function

Neurocomputing
Clustering Software Components for Component Reuse and Program Restructuring

Proceedings of the Second International Conference on Innovative Computing and Cloud Computing
Locality mutual clustering for document retrieval

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Knowledge discovery in inspection reports of marine structures

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345-366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much.