An efficient clustering algorithm for large-scale topical web pages

Authors:
Lei Wang;Peng Chen;Lian'en Huang
Affiliations:
Peking University, Shenzhen, China;Peking University, Shenzhen, China;Peking University, Shenzhen, China
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 6
Cited 0

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed data clustering can be efficient and exact

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
Latent dirichlet allocation

The Journal of Machine Learning Research
A scaleable document clustering approach for large document corpora

Information Processing and Management: an International Journal
Data weaving: scaling up the state-of-the-art in data clustering

Proceedings of the 17th ACM conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The clustering of topic-related web pages has been recognized as a foundational work in exploiting large sets of web pages such as the cases in search engines and web archive systems, which collect and preserve billions of web pages. However, this task faces great challenges both in efficiency and accuracy. In this paper we present a novel clustering algorithm for large scale topical web pages which achieves high efficiency together with considerately high accuracy. In our algorithm, a two-phase divide and conquer framework is developed to solve the efficiency problem, in which both link analysis and content analysis are utilized in mining the topical similarity between pages to achieve a high accuracy. A comprehensive experiment was conducted to evaluate our method in terms of its effectiveness, efficiency, and quality of result.