Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Introduction to Data Mining, (First Edition)
Introduction to Data Mining, (First Edition)
Query Selection Techniques for Efficient Crawling of Structured Web Sources
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Communications of the ACM - ACM at sixty: a look back in time
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
An adaptive crawler for locating hidden-Web entry points
Proceedings of the 16th international conference on World Wide Web
Weakly-supervised discovery of named entities using web search queries
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Freebase: a collaboratively created graph database for structuring human knowledge
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the VLDB Endowment
Named entity recognition in query
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Crawling Deep Web Using a New Set Covering Algorithm
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
Open entity extraction from web search query logs
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Hi-index | 0.00 |
Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.