Efficient filtering and ranking schemes for finding inclusion dependencies on the web

Authors:
Atsuyuki Morishima;Erika Yumiya;Masami Takahashi;Shigeo Sugimoto;Hiroyuki Kitagawa
Affiliations:
University of Tsukuba, Tsukuba, Japan;University of Tsukuba, Tsukuba, Japan;University of Tsukuba, Tsukuba, Japan;University of Tsukuba, Tsukuba, Japan;University of Tsukuba, Tsukuba, Japan
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 17
Cited 0

Evaluation of signature files as set access facilities in OODBs

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
NeT & CoT: translating relational schemas to XML schemas using semantic constraints

Proceedings of the eleventh international conference on Information and knowledge management
Adaptive algorithms for set containment joins

ACM Transactions on Database Systems (TODS)
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Controlling overlap in content-oriented XML retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Efficiently Computing Inclusion Dependencies for Schema Discovery

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
XML search: languages, INEX and scoring

ACM SIGMOD Record
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Dependencies revisited for improving data quality

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Unary and n-ary inclusion dependency discovery in relational databases

Journal of Intelligent Information Systems
Inclusion Dependencies in XML: Extending Relational Semantics

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
On indexing error-tolerant set containment

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Fast detection of functional dependencies in XML data

XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
Towards certain fixes with editing rules and master data

Proceedings of the VLDB Endowment
On multi-column foreign key discovery

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data integrity constraints are fundamental in various applications, such as data management, integration, cleaning, and schema extraction. In this paper, we address the problem of finding inclusion dependencies on the Web. The problem is important because (1) applications of inclusion dependencies, such as data quality management, are beneficial in the Web context, and (2) such dependencies are not explicitly given in general. In our approach, we enumerate pairs of HTML/XML elements that possibly represent inclusion dependencies and then rank the results for verification. First, we propose a bit-based signature scheme to efficiently select candidates (element pairs) in the enumeration process. The signature scheme is unique in that it supports Jaccard containment to deal with the incomplete nature of data on the Web, and preserves the semiorder inclusion relationship among sets of words. Second, we propose a ranking scheme to support a user in checking whether each enumerated pair actually suggests inclusion dependencies. The ranking scheme sorts the enumerated pairs so that we can examine a small number of pairs for simultaneously verifying many pairs.