Which way now? Analysing and easing inadequacies in WWW navigation
International Journal of Human-Computer Studies
Analysis of a very large web search engine query log
ACM SIGIR Forum
Compression of inverted indexes For fast query evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation
CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
ACM SIGIR Forum
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
Learnable topic-specific web crawler
Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
A systematic study on parameter correlations in large-scale duplicate document detection
Knowledge and Information Systems
Hi-index | 0.00 |
To improve the retrieval efficiency and performance of the large scale information retrieval systems, analyzed existing algorithm for Web search and duplicated Web pages detection. However, it has some drawback in terms of precision and efficiency because of its generality and no specialty. In this paper, with crawler and duplicated pages analysis, addressed two issues of the topic-oriented Web crawler and near-replicas detection. One is how to make the definition of the topic; the other is how to eliminate duplicate pages. It aimed to visit only topic-oriented pages, and got a great scale of hyperlinks which link to the topic-oriented pages. The crawl and Web pages detection method is a novel one, which was based on the semi-structured features of the website and content information. The results of experiment show that it is better than that of the existing algorithms proposed in the literature.