HDSampler: revealing data behind web form interfaces

Authors:
Anirban Maiti;Arjun Dasgupta;Nan Zhang;Gautam Das
Affiliations:
University of Texas at Arlington, Arlington, TX, USA;University of Texas at Arlington, Arlington, TX, USA;George Washington University, Washington, D.C., DC, USA;University of Texas at Arlington, Arlington, TX, USA
Venue:
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Year:
2009

Citing 2
Cited 2

A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Leveraging COUNT Information in Sampling Hidden Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering

OPAL: automated form understanding for the deep web

Proceedings of the 21st international conference on World Wide Web
The ontological key: automatically understanding and integrating forms to access the deep Web

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large number of online databases are hidden behind the web. Users to these systems can form queries through web forms to retrieve a small sample of the database. Sampling such hidden databases is widely desired for understanding the nature and quality of data stored in them. We have developed HDSampler, which to the best of our knowledge is the first practical system for sampling structured hidden web databases. It enables efficient sampling of the databases and accurate answering of aggregate queries, to provide analysts with valuable information for data analytics, as well as help power a multitude of third-party applications such as web-mashups and meta-search engines. For the purpose of this demo, we present an instance of HDSampler on Google Base - a content-rich hidden web database maintained by Google. By using HDSampler, the demo reveals a snapshot of the marginal distribution of various attributes of Google Base in a matter of minutes.