A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures

Authors:
Xiaoyan Shen;Junliang Chen;Xiangwu Meng;Yujie Zhang;Chuanchang Liu
Affiliations:
State key Laboratory of Networking and Switching Technology, Beijing University of Posts, and Telecommunications, Beijing, China;State key Laboratory of Networking and Switching Technology, Beijing University of Posts, and Telecommunications, Beijing, China;State key Laboratory of Networking and Switching Technology, Beijing University of Posts, and Telecommunications, Beijing, China;State key Laboratory of Networking and Switching Technology, Beijing University of Posts, and Telecommunications, Beijing, China;State key Laboratory of Networking and Switching Technology, Beijing University of Posts, and Telecommunications, Beijing, China
Venue:
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2009

Citing 9
Cited 0

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Effectively Finding Relevant Web Pages from Linkage Information

IEEE Transactions on Knowledge and Data Engineering
Automatic Identification of Informative Sections of Web Pages

IEEE Transactions on Knowledge and Data Engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Practical Algorithms and Lower Bounds for Similarity Search in Massive Graphs

IEEE Transactions on Knowledge and Data Engineering
Finding related pages using Green measures: an illustration with Wikipedia

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
On URL normalization

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a simple but powerful algorithm: block co-citation algorithm is proposed to automatically find related pages for a given web page, by using HTML segmentation technologies and parallel hyperlink structure analysis. First, all hyperlinks in a web page are segmented into several blocks according to the HTML structure and text style information. Second, for each page, the similarity between every two hyperlinks in the same block of the page is computed according to several information, then the total similarity from one page to the other is obtained after all web pages are processed. For a given page u, the pages which have the highest total similarity to u are selected as the related pages of u. At last, the block co-citation algorithm is implemented in parallel to analyze a corpus of 37482913 pages sampled from a commercial search engine and demonstrates its feasibility and efficiency.