An effective web document clustering algorithm based on bisection and merge

Authors:
Ingyu Lee;Byung-Won On
Affiliations:
Sorrell College of Business, Troy University, Troy, USA;Advanced Digital Sciences Center, Singapore, Singapore
Venue:
Artificial Intelligence Review
Year:
2011

Citing 13
Cited 0

Partitioning sparse matrices with eigenvectors of graphs

SIAM Journal on Matrix Analysis and Applications
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
A parallel algorithm for multilevel graph partitioning and sparse matrix ordering

Journal of Parallel and Distributed Computing
Scientific Computing

Scientific Computing
A linear-time heuristic for improving network partitions

DAC '82 Proceedings of the 19th Design Automation Conference
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Disambiguating Web appearances of people in a social network

WWW '05 Proceedings of the 14th international conference on World Wide Web
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Contextual search and name disambiguation in email using graphs

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A divide-and-merge methodology for clustering

ACM Transactions on Database Systems (TODS)
Weighted Graph Cuts without Eigenvectors A Multilevel Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence
PSNUS: web people name disambiguation by simple clustering with rich features

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations

Quantified Score

Hi-index	0.00

Visualization

Abstract

To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.