Parallel crawler architecture and web page change detection

Authors:
Divakar Yadav;A. K. Sharma;J. P. Gupta
Affiliations:
Computer Science & Information Technology, Jaypee Institute of Information Technology University, Noida, India;Computer Science & Information Technology, Jaypee Institute of Information Technology University, Noida, India;Computer Science & Information Technology, Jaypee Institute of Information Technology University, Noida, India
Venue:
WSEAS Transactions on Computers
Year:
2008

Citing 21
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
WebCQ-detecting and delivering information changes on the web

Proceedings of the ninth international conference on Information and knowledge management
Managing change on the web

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Perception of content, structure, and presentation changes in Web-based hypertext

Proceedings of the 12th ACM conference on Hypertext and Hypermedia
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Detecting Changes in XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Distributed Information Retrieval by Using Cooperative Meta Search Engines

ICDCSW '01 Proceedings of the 21st International Conference on Distributed Computing Systems
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Distributed location aware web crawling

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Efficient web change monitoring with page digest

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages

ICIT '07 Proceedings of the 10th International Conference on Information Technology
Change Detection in Web Pages

ICIT '07 Proceedings of the 10th International Conference on Information Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we put forward a technique for parallel crawling of the web. The World Wide Web today is growing at a phenomenal rate. It has enabled a publishing explosion of useful online information, which has produced the unfortunate side effect of information overload. The size of the web as on February 2007 stands at around 29 billion pages. One of the most important uses of crawling the web is for indexing purposes and keeping web pages up-to-date, later used by search engine to serve the end user queries. The paper puts forward an architecture built on the lines of client server architecture. It discuses a fresh approach for parallel crawling the web using multiple machines and integrates the trivial issues of crawling also. A major part of the web is dynamic and hence, a need arises to constantly update the changed web pages. We have used a three-step algorithm for page refreshment. This checks for whether the structure of a web page has been changed or not, the text content has been altered or whether an image is changed. For The server we have discussed a unique method for distribution of URLs to client machines after determination of their priority index. Also a minor variation to the method of prioritizing URLs on the basis of forward link count has been discussed to accommodate the purpose of frequency of update.