Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed data clustering can be efficient and exact
ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
A Fast Parallel Clustering Algorithm for Large Spatial Databases
Data Mining and Knowledge Discovery
The Journal of Machine Learning Research
A scaleable document clustering approach for large document corpora
Information Processing and Management: an International Journal
Data weaving: scaling up the state-of-the-art in data clustering
Proceedings of the 17th ACM conference on Information and knowledge management
Hi-index | 0.00 |
The clustering of topic-related web pages has been recognized as a foundational work in exploiting large sets of web pages such as the cases in search engines and web archive systems, which collect and preserve billions of web pages. However, this task faces great challenges both in efficiency and accuracy. In this paper we present a novel clustering algorithm for large scale topical web pages which achieves high efficiency together with considerately high accuracy. In our algorithm, a two-phase divide and conquer framework is developed to solve the efficiency problem, in which both link analysis and content analysis are utilized in mining the topical similarity between pages to achieve a high accuracy. A comprehensive experiment was conducted to evaluate our method in terms of its effectiveness, efficiency, and quality of result.