Sampling dirty data for matching attributes

Authors:
Henning Köhler;Xiaofang Zhou;Shazia Sadiq;Yanfeng Shu;Kerry Taylor
Affiliations:
The University of Queensland, Brisbane, Australia;The University of Queensland and NICTA, Brisbane, Australia;The University of Queensland, Brisbane, Australia;CSIRO - Tasmanian ICT Centre, Hobart, Australia;CSIRO - ICT Centre, Canberra, Australia
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 23
Cited 7

Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Schema Mapping as Query Discovery

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Query sampling in DB2 Universal Database

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Mathematical Theory of Communication

A Mathematical Theory of Communication
From databases to dataspaces: a new abstraction for information management

ACM SIGMOD Record
Principles of dataspace systems

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data integration: the teenage years

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Validating Multi-column Schema Matchings by Type

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Rebuilding the world from views

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Schema mapping with quality assurance for data integration

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Efficient name disambiguation in digital libraries

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Towards realistic sampling: generating dependencies in a relational database

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Automated discovery of multi-faceted ontologies for accurate query answering and future semantic reasoning

Data & Knowledge Engineering
Data centric research at the University of Queensland

ACM SIGMOD Record
Robust hybrid name disambiguation framework for large databases

Scientometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.