A pattern-based selective recrawling approach for object-level vertical search

Authors:
Yaqian Zhou;Qi Zhang;Xuanjing Huang;Lide Wu
Affiliations:
Fudan University, Shanghai, China;Fudan University, Shanghai, China;Fudan University, Shanghai, China;Fudan University, Shanghai, China
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 19
Cited 0

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools

ACM SIGMOD Record
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
Selective recrawling for object-level vertical search

Proceedings of the 19th international conference on World wide web
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Meta-search based web resource discovery for object-level vertical search

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Sentimental Spidering: Leveraging Opinion Information in Focused Crawlers

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.