PruSM: a prudent schema matching approach for web forms

Authors:
Thanh Hoang Nguyen;Hoa Nguyen;Juliana Freire
Affiliations:
Unversity of Utah, Salt Lake City, UT, USA;Unversity of Utah, Salt Lake City, UT, USA;Unversity of Utah, Salt Lake City, UT, USA
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 14
Cited 2

GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Automatic complex schema matching across Web query interfaces: A correlation mining approach

ACM Transactions on Database Systems (TODS)
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Wise-integrator: an automatic integrator of web search interfaces for E-commerce

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Bootstrapping pay-as-you-go data integration systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Learning to extract form labels

Proceedings of the VLDB Endowment
Google's Deep Web crawl

Proceedings of the VLDB Endowment
A survey of schema-based matching approaches

Journal on Data Semantics IV
Managing uncertainty in schema matching with top-k schema mappings

Journal on Data Semantics VI

Multilingual schema matching for Wikipedia infoboxes

Proceedings of the VLDB Endowment
The ontological key: automatically understanding and integrating forms to access the deep Web

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.