Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Lucene in Action (In Action series)
Lucene in Action (In Action series)
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Efficient, automatic web resource harvesting
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Towards a query optimizer for text-centric tasks
ACM Transactions on Database Systems (TODS)
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
Addressing Effective Hidden Web Search Using Iterative Deepening Search and Graph Theory
CITWORKSHOPS '08 Proceedings of the 2008 IEEE 8th International Conference on Computer and Information Technology Workshops
An Approach to Deep Web Crawling by Sampling
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Incremental structured web database crawling via history versions
WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Topic-Sensitive hidden-web crawling
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases
Journal of Intelligent Information Systems
Crawling deep web entity pages
Proceedings of the sixth ACM international conference on Web search and data mining
Formal concept analysis approach for data extraction from a limited deep web database
Journal of Intelligent Information Systems
Hi-index | 0.00 |
Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms assume the uniform distribution of the elements being covered, while for deep web crawling, neither the sizes of documents nor the document frequencies of the queries is distributed uniformly. Instead, they follow the power law distribution. Hence, we have developed a new set covering algorithm that targets at web crawling. Compared to our previous deep web crawling method that uses a straightforward greedy set covering algorithm, it introduces weights into the greedy strategy. Our experiment carried out on a variety of corpora shows that this new method consistently outperforms its un-weighted version.