A method for pinpoint clustering of web pages with pseudo-clique search

Authors:
Makoto Haraguchi;Yoshiaki Okubo
Affiliations:
Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan;Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan
Venue:
Proceedings of the 2005 international conference on Federation over the Web
Year:
2005

Citing 5
Cited 5

Formal Concept Analysis: Mathematical Foundations

Formal Concept Analysis: Mathematical Foundations
Simple and Fast: Improving a Branch-And-Bound Algorithm for Maximum Clique

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
An efficient branch-and-bound algorithm for finding a maximum clique

DMTCS'03 Proceedings of the 4th international conference on Discrete mathematics and theoretical computer science
An overview of web data clustering practices

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Finding significant web pages with lower ranks by pseudo-clique search

DS'05 Proceedings of the 8th international conference on Discovery Science

Finding Top-N Pseudo Formal Concepts with Core Intents

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
An extended branch and bound search algorithm for finding top-N formal concepts of documents

JSAI'06 Proceedings of the 20th annual conference on New frontiers in artificial intelligence
An efficient algorithm for enumerating pseudo cliques

ISAAC'07 Proceedings of the 18th international conference on Algorithms and computation
An algorithm for extracting rare concepts with concise intents

ICFCA'10 Proceedings of the 8th international conference on Formal Concept Analysis
User community reconstruction using sampled microblogging data

Proceedings of the 21st international conference companion on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a method for Pinpoint Clustering of web pages. We try to find useful clusters of web pages which are significant in the sense that their contents are similar to ones of higher-ranked pages. Since we are usually careless of lower-ranked pages, they are unconditionally discarded even if their contents are similar to some pages with high ranks. Such hidden pages together with significant higher-ranked pages are extracted as a cluster. As the result, our clusters can provide new valuable information for users. In order to obtain such clusters, we first extract semantic correlations among terms by applying Singular Value Decomposition (SVD) to the term-document matrix generated from a corpus. Based on the correlations, we can evaluate potential similarities among web pages to be clustered. The set of web pages is represented as a weighted graph G based on the similarities and their ranks. Our clusters can be found as pseudo-cliques in G. An algorithm for finding Top-N weighted pseudo-cliques is presented. Our experimental result shows that a quite valuable cluster can be actually extracted according to our method. We also discuss an idea for improvement on meanings of clusters. With the help of Formal Concept Analysis, our clusters, called FC-based clusters, can be provided with clear meanings. Our preliminary experimentation shows that the extended method would be a promising approach to finding meaningful clusters.