Distributed collaborative Web document clustering using cluster keyphrase summaries

Authors:
Khaled Hammouda;Mohamed Kamel
Affiliations:
Department of Systems Design Engineering, Pattern Analysis and Machine Intelligence (PAMI) Research Group, University of Waterloo, Waterloo, Ont., Canada N2L 3G1;Department of Electrical and Computer Engineering, Pattern Analysis and Machine Intelligence (PAMI) Research Group, University of Waterloo, Waterloo, Ont., Canada N2L 3G1
Venue:
Information Fusion
Year:
2008

Citing 16
Cited 2

Algorithms for clustering data

Algorithms for clustering data
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Data clustering: a review

ACM Computing Surveys (CSUR)
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
A vector space model for automatic indexing

Communications of the ACM
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Learning Algorithms for Keyphrase Extraction

Information Retrieval
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Phrase-based Document Similarity Based on an Index Graph Model

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Incremental Document Clustering Using Cluster Similarity Histograms

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
TopCat: Data Mining for Topic Identification in a Text Corpus

IEEE Transactions on Knowledge and Data Engineering
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Document Similarity Using a Phrase Indexing Graph Model

Knowledge and Information Systems
Information fusion in the context of multi-document summarization

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Fuzzy combinations of criteria: an application to web page representation for clustering

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Robust estimation of a global Gaussian mixture by decentralized aggregations of local models

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

For the past few decades the mainstream data clustering technologies have been fundamentally based on centralized operation; data sets were of small manageable sizes, and usually resided on one site that belonged to one organization. Today, data is of enormous sizes and is usually located on distributed sites; the primary example being the Web. This created a need for performing clustering in distributed environments. Distributed clustering solves two problems: infeasibility of collecting data at a central site, due to either technical and/or privacy limitations, and intractability of traditional clustering algorithms on huge data sets. In this paper we propose a distributed collaborative clustering approach for clustering Web documents in distributed environments. We adopt a peer-to-peer model, where the main objective is to allow nodes in a network to first form independent opinions of local document clusterings, then collaborate with peers to enhance the local clusterings. Information exchanged between peers is minimized through the use of cluster summaries in the form of keyphrases extracted from the clusters. This summarized view of peer data enables nodes to request merging of remote data selectively to enhance local clusters. Initial clustering, as well as merging peer data with local clusters, utilizes a clustering method, called similarity histogram-based clustering, based on keeping a tight similarity distribution within clusters. This approach achieves significant improvement in local clustering solutions without the cost of centralized clustering, while maintaining the initial local clustering structure. Results show that larger networks exhibit larger improvements, up to 15% improvement in clustering quality, albeit lower absolute clustering quality than smaller networks.