A probabilistic relational approach for web document clustering

Authors:
E. Fersini;E. Messina;F. Archetti
Affiliations:
Dipartimento di Informatica Sistemistica e Comunicazione, Universití degli Studi di Milano-Bicocca, Italy;Dipartimento di Informatica Sistemistica e Comunicazione, Universití degli Studi di Milano-Bicocca, Italy;Dipartimento di Informatica Sistemistica e Comunicazione, Universití degli Studi di Milano-Bicocca, Italy and Consorzio Milano Ricerche, Via Cozzi 53, 20126 Milano, Italy
Venue:
Information Processing and Management: an International Journal
Year:
2010

Citing 16
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A vector space model for automatic indexing

Communications of the ACM
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Learning Probabilistic Relational Models

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
A Unified Framework for Clustering Heterogeneous Web Objects

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
ReCoM: reinforcement clustering of multi-type interrelated data objects

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A unified framework for model-based clustering

The Journal of Machine Learning Research
Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging

IEEE Transactions on Knowledge and Data Engineering
Clustering Web Documents Based on Correlation of Hyperlinks

ICDEW '05 Proceedings of the 21st International Conference on Data Engineering Workshops
Enhancing web page classification through image-block importance analysis

Information Processing and Management: an International Journal
Probabilistic classification and clustering in relational data

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
A hierarchical document clustering environment based on the induced bisecting k-means

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
An incremental document clustering algorithm based on a hierarchical agglomerative approach

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The exponential growth of information available on the World Wide Web, and retrievable by search engines, has implied the necessity to develop efficient and effective methods for organizing relevant contents. In this field document clustering plays an important role and remains an interesting and challenging problem in the field of web computing. In this paper we present a document clustering method, which takes into account both contents information and hyperlink structure of web page collection, where a document is viewed as a set of semantic units. We exploit this representation to determine the strength of a relation between two linked pages and to define a relational clustering algorithm based on a probabilistic graph representation. The experimental results show that the proposed approach, called RED-clustering, outperforms two of the most well known clustering algorithm as k-Means and Expectation Maximization.