Research on new algorithm of topic-oriented crawler and duplicated web pages detection

  • Authors:
  • Yong-Heng Zhang;Feng Zhang

  • Affiliations:
  • School of Information Engineering, Yulin University, Yulin, China;School of Information Engineering, Yulin University, Yulin, China,School of automation, Northwestern Polytechnical University, Xi'an, China

  • Venue:
  • ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

To improve the retrieval efficiency and performance of the large scale information retrieval systems, analyzed existing algorithm for Web search and duplicated Web pages detection. However, it has some drawback in terms of precision and efficiency because of its generality and no specialty. In this paper, with crawler and duplicated pages analysis, addressed two issues of the topic-oriented Web crawler and near-replicas detection. One is how to make the definition of the topic; the other is how to eliminate duplicate pages. It aimed to visit only topic-oriented pages, and got a great scale of hyperlinks which link to the topic-oriented pages. The crawl and Web pages detection method is a novel one, which was based on the semi-structured features of the website and content information. The results of experiment show that it is better than that of the existing algorithms proposed in the literature.