Web document clustering based on web log mining

Authors:
Jian Yu;Xiaolin Lu;Yimin Yu
Affiliations:
Computer Institute, Zhejiang University of Science and Technology, Hangzhou, Zhejiang, China;Computer Institute, Zhejiang University of Science and Technology, Hangzhou, Zhejiang, China;Computer Institute, Zhejiang University of Science and Technology, Hangzhou, Zhejiang, China
Venue:
ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers
Year:
2006

Citing 5
Cited 0

CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
On-Line Clustering

IEEE Transactions on Knowledge and Data Engineering
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Adaptive web sites: an AI challenge

IJCAI'97 Proceedings of the 15th international joint conference on Artifical intelligence - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

As an increasing number of user access information on the Web, there is a great opportunity to learn from the Web server logs to cluster large amounts of Web documents. One approach is to cluster the documents based on information provided only by users' usage logs and not by the content of the documents. A major advantage of this approach is that the relevancy information is objectively reflected by the usage logs; frequent simultaneous visits to two seemingly unrelated documents should indicate that they are in fact closely related. Our clustering algorithm PDBSCAN (Partitioning Based DBSCAN algorithm) is based on DBSCAN, a density based algorithm that has been proven in its ability in processing very large datasets. In addition, we prove both analytically and experimentally that our method yields clustering results that are superior to that of DBSCAN.