Research on new algorithm of topic-oriented crawler and duplicated web pages detection

Authors:
Yong-Heng Zhang;Feng Zhang
Affiliations:
School of Information Engineering, Yulin University, Yulin, China;School of Information Engineering, Yulin University, Yulin, China,School of automation, Northwestern Polytechnical University, Xi'an, China
Venue:
ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
Year:
2012

Citing 8
Cited 0

Which way now? Analysing and easing inadequacies in WWW navigation

International Journal of Human-Computer Studies
Analysis of a very large web search engine query log

ACM SIGIR Forum
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
A taxonomy of web search

ACM SIGIR Forum
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Learnable topic-specific web crawler

Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
A systematic study on parameter correlations in large-scale duplicate document detection

Knowledge and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

To improve the retrieval efficiency and performance of the large scale information retrieval systems, analyzed existing algorithm for Web search and duplicated Web pages detection. However, it has some drawback in terms of precision and efficiency because of its generality and no specialty. In this paper, with crawler and duplicated pages analysis, addressed two issues of the topic-oriented Web crawler and near-replicas detection. One is how to make the definition of the topic; the other is how to eliminate duplicate pages. It aimed to visit only topic-oriented pages, and got a great scale of hyperlinks which link to the topic-oriented pages. The crawl and Web pages detection method is a novel one, which was based on the semi-structured features of the website and content information. The results of experiment show that it is better than that of the existing algorithms proposed in the literature.