Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The connectivity server: fast access to linkage information on the Web
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Intelligent crawling on the World Wide Web with arbitrary predicates
Proceedings of the 10th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
Using Reinforcement Learning to Spider the Web Efficiently
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Proceedings of the 27th International Conference on Very Large Data Bases
Automated discovery of search interfaces on the web
ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Statistical schema matching across web query interfaces
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Computer
An interactive clustering-based approach to integrating source query interfaces on the deep Web
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Organizing structured web sources by query schemas: a clustering approach
Proceedings of the thirteenth ACM international conference on Information and knowledge management
WISE-cluster: clustering e-commerce search engines automatically
Proceedings of the 6th annual ACM international workshop on Web information and data management
Structured databases on the web: observations and implications
ACM SIGMOD Record
ACM SIGIR Forum
Light-weight domain-based form assistant: querying web databases on the fly
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Communications of the ACM - ACM at sixty: a look back in time
An adaptive crawler for locating hidden-Web entry points
Proceedings of the 16th international conference on World Wide Web
A Method for Focused Crawling Using Combination of Link Structure and Content Similarity
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Accurate and efficient crawling for relevant websites
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A machine learning approach to web page filtering using content and structure analysis
Decision Support Systems
ICIW '08 Proceedings of the 2008 Third International Conference on Internet and Web Applications and Services
Proceedings of the VLDB Endowment
Domain-Specific Deep Web Sources Discovery
ICNC '08 Proceedings of the 2008 Fourth International Conference on Natural Computation - Volume 05
Topical web crawling using weighted anchor text and web page change detection techniques
WSEAS Transactions on Information Science and Applications
Crawling Deep Web Using a New Set Covering Algorithm
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Domain-oriented Deep Web Data Sources' Discovery and Identification
APWEB '10 Proceedings of the 2010 12th International Asia-Pacific Web Conference
Clustering structured web sources: a schema-based, model-differentiation approach
EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Formal concept analysis approach for data extraction from a limited deep web database
Journal of Intelligent Information Systems
Hi-index | 0.00 |
A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs' entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs' forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs' forms simultaneously. In this paper, an Enhanced Form-Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions' limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs' forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have been conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness.