Query Selection Techniques for Efficient Crawling of Structured Web Sources

Authors:
Ping Wu;Ji-Rong Wen;Huan Liu;Wei-Ying Ma
Affiliations:
University of California, Santa Barbara;Microsoft Research, Asia;Arizona State University;Microsoft Research, Asia
Venue:
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Year:
2006

Citing 0
Cited 30

Combining classifiers to identify online databases

Proceedings of the 16th international conference on World Wide Web
Ontology-Based Deep Web Data Sources Selection

HAIS '08 Proceedings of the 3rd international workshop on Hybrid Artificial Intelligence Systems
Learning to extract form labels

Proceedings of the VLDB Endowment
Google's Deep Web crawl

Proceedings of the VLDB Endowment
An Approach to Deep Web Crawling by Sampling

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Crawling Deep Web Using a New Set Covering Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Kosmix: high-performance topic exploration using the deep web

Proceedings of the VLDB Endowment
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment
Estimating deep web data source size by capture---recapture method

Information Retrieval
Web Crawling

Foundations and Trends in Information Retrieval
Querying capability modeling and construction of deep web sources

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Optimizing content freshness of relations extracted from the web using keyword search

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
On building a search interface discovery system

RED'09 Proceedings of the 2nd international conference on Resource discovery
Deep Web adaptive crawling based on minimum executable pattern

Journal of Intelligent Information Systems
Incremental structured web database crawling via history versions

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Layout object model for extracting the schema of web query interfaces

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Parallelizing skyline queries for scalable distribution

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Efficient deep web crawling using reinforcement learning

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Hybrid metaheuristic algorithms for minimum weight dominating set

Applied Soft Computing
Topic-Sensitive hidden-web crawling

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Materialization of web data sources

Search Computing
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining
Understanding query interfaces by statistical parsing

ACM Transactions on the Web (TWEB)
Learning to crawl deep web

Information Systems
Deep web entity monitoring

Proceedings of the 22nd international conference on World Wide Web companion
Mining a search engine's corpus without a query pool

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Architecture specification of rule-based deep web crawler with indexer

International Journal of Knowledge and Web Intelligence
Formal concept analysis approach for data extraction from a limited deep web database

Journal of Intelligent Information Systems
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The high quality, structured data from Web structured sources is invaluable for many applications. Hidden Web databases are not directly crawlable by Web search engines and are only accessible through Web query forms or via Web service interfaces. Recent research efforts have been focusing on understanding these Web query forms. A critical but still largely unresolved question is: how to efficiently acquire the structured information inside Web databases through iteratively issuing meaningful queries? In this paper we focus on the central issue of enabling efficient Web database crawling through query selection, i.e. how to select good queries to rapidly harvest data records from Web databases. We model each structured Web database as a distinct attribute-value graph. Under this theoretical framework, the database crawling problem is transformed into a graph traversal one that follows "relational" links. We show that finding an optimal query selection plan is equivalent to finding a Minimum Weighted Dominating Set of the corresponding database graph, a well-known NP-Complete problem. We propose a suite of query selection techniques aiming at optimizing the query harvest rate. Extensive experimental evaluations over real Web sources and simulations over controlled database servers validate the effectiveness of our techniques and provide insights for future efforts in this