Attribute domain discovery for hidden web databases

Authors:
Xin Jin;Nan Zhang;Gautam Das
Affiliations:
George Washington University, Washington, DC, USA;George Washington University, Washington , DC, USA;University of Texas at Arlington, Arlington, TX, USA
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 21
Cited 3

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Discovering the representative of a search engine

Proceedings of the eleventh international conference on Information and knowledge management
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Discovering complex matchings across web query interfaces: a correlation mining approach

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Managing information extraction: state of the art and research directions

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Accessing the web: from search to integration

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Meaningful labeling of integrated query interfaces

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Leveraging COUNT Information in Sampling Hidden Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Understanding deep web search interfaces: a survey

ACM SIGMOD Record

MOBIES: mobile-interface enhancement service for hidden web database

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ProFoUnd: program-analysis-based form understanding

Proceedings of the 21st international conference companion on World Wide Web
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many web databases are hidden behind restrictive form-like interfaces which may or may not provide domain information for an attribute. When attribute domains are not available, domain discovery becomes a critical challenge facing the application of a broad range of existing techniques on third-party analytical and mash-up applications over hidden databases. In this paper, we consider the problem of domain discovery over a hidden database through its web interface. We prove that for any database schema, an achievability guarantee on domain discovery can be made based solely upon the interface design. We also develop novel techniques which provide effective guarantees on the comprehensiveness of domain discovery. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.