Finding significant web pages with lower ranks by pseudo-clique search

Authors:
Yoshiaki Okubo;Makoto Haraguchi;Bin Shi
Affiliations:
Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan;Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan;Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan
Venue:
DS'05 Proceedings of the 8th international conference on Discovery Science
Year:
2005

Citing 3
Cited 2

Simple and Fast: Improving a Branch-And-Bound Algorithm for Maximum Clique

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
An efficient branch-and-bound algorithm for finding a maximum clique

DMTCS'03 Proceedings of the 4th international conference on Discrete mathematics and theoretical computer science
An overview of web data clustering practices

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

An extended branch and bound search algorithm for finding top-N formal concepts of documents

JSAI'06 Proceedings of the 20th annual conference on New frontiers in artificial intelligence
A method for pinpoint clustering of web pages with pseudo-clique search

Proceedings of the 2005 international conference on Federation over the Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we discuss a method of finding useful clusters of web pages which are significant in the sense that their contents are similar or closely related to ones of higher-ranked pages. Since we are usually careless of pages with lower ranks, they are unconditionally discarded even if their contents are similar to some pages with high ranks. We try to extract such hidden pages together with significant higher-ranked pages as a cluster. In order to obtain such clusters, we first extract semantic correlations among terms by applying Singular Value Decomposition(SVD) to the term-document matrix generated from a corpus w.r.t. a specific topic. Based on the correlations, we can evaluate potential similarities among web pages from which we try to obtain clusters. The set of web pages is represented as a weighted graph G based on the similarities and their ranks. Our clusters can be found as pseudo-cliques in G. We present an algorithm for finding Top-N weighted pseudo-cliques. Our experimental result shows that quite valuable clusters can be actually extracted according to our method.