E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Authors:
Yanni Li;Yuping Wang;Jintao Du
Affiliations:
School of Computer Science and Technology, Xidian University, Xi'an, People's Republic of China 710071;School of Computer Science and Technology, Xidian University, Xi'an, People's Republic of China 710071;School of Software, Xidian University, Xi'an, People's Republic of China 710071
Venue:
Journal of Intelligent Information Systems
Year:
2013

Citing 29
Cited 1

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Automated discovery of search interfaces on the web

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Databases Deepen the Web

Computer
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management
WISE-cluster: clustering e-commerce search engines automatically

Proceedings of the 6th annual ACM international workshop on Web information and data management
Structured databases on the web: observations and implications

ACM SIGMOD Record
Effective web crawling

ACM SIGIR Forum
Light-weight domain-based form assistant: querying web databases on the fly

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
A Method for Focused Crawling Using Combination of Link Structure and Content Similarity

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Accurate and efficient crawling for relevant websites

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A machine learning approach to web page filtering using content and structure analysis

Decision Support Systems
An Architectural Framework of a Crawler for Locating Deep Web Repositories Using Learning Multi-agent Systems

ICIW '08 Proceedings of the 2008 Third International Conference on Internet and Web Applications and Services
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Domain-Specific Deep Web Sources Discovery

ICNC '08 Proceedings of the 2008 Fourth International Conference on Natural Computation - Volume 05
Topical web crawling using weighted anchor text and web page change detection techniques

WSEAS Transactions on Information Science and Applications
Crawling Deep Web Using a New Set Covering Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Domain-oriented Deep Web Data Sources' Discovery and Identification

APWEB '10 Proceedings of the 2010 12th International Asia-Pacific Web Conference
Clustering structured web sources: a schema-based, model-differentiation approach

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

Formal concept analysis approach for data extraction from a limited deep web database

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs' entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs' forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs' forms simultaneously. In this paper, an Enhanced Form-Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions' limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs' forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have been conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness.