Discriminating meta-search: a framework for evaluation
Information Processing and Management: an International Journal - Special issue on progress toward digital libraries
ACM SIGMETRICS Performance Evaluation Review
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Proceedings of the 27th International Conference on Very Large Data Bases
On the Automatic Extraction of Data from the Hidden Web
Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Internet Search Engine Freshness by Web Server Help
SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
Architectural styles and the design of network-based software architectures
Architectural styles and the design of network-based software architectures
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Hi-index | 0.00 |
The traditional crawlers used by search engines to build their collection of Web pages frequently gather unmodified pages that already exist in their collection. This creates unnecessary Internet traffic and wastes search engine resources during page collection and indexing. Generally, the crawlers are also unable to collect dynamic pages, causing them to miss valuable information, and they cannot easily detect deleted pages, resulting in outdated search engine collections. To address these issues, we propose a new Web services paradigm for Website/crawler interaction that is co-operative and exploits the information present in the Web logs and file system. Our system supports a querying mechanism wherein the crawler can issue queries to the Web service on the Website and then collect pages based on the information provided in response to the query. We present experimental results demonstrating that, when compared to traditional crawlers, this approach provides bandwidth savings, more complete Web page collections, and collections that are notified of deleted pages. We experimentally compare the relative merits of using only Web logs, only file system information, and combinations of the two sources to provide information for the Web service.