Randomized algorithms for data reconciliation in wide area aggregate query processing

Authors:
Fei Xu;Christopher Jermaine
Affiliations:
University of Florida, Gainesville, FL;University of Florida, Gainesville, FL
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 24
Cited 0

Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Consistent query answers in inconsistent databases

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The Clio project: managing heterogeneity

ACM SIGMOD Record
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Garlic: a new flavor of federated query processing for DB2

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Machine Learning
Distributed Transitive Closure Computations: The Disconnection Set Approach

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
The Piazza Peer Data Management System

IEEE Transactions on Knowledge and Data Engineering
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Industrial-strength schema matching

ACM SIGMOD Record
ConQuer: efficient management of inconsistent databases

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data sharing in the Hyperion peer database system

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Approximate joins: concepts and techniques

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
What's Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Data integration: the teenage years

VLDB '06 Proceedings of the 32nd international conference on Very large data bases

Quantified Score

Hi-index	0.01

Visualization

Abstract

Many aspects of the data integration problem have been considered in the literature: how to match schemas across different data sources, how to decide when different records refer to the same entity, how to efficiently perform the required entity resolution in a batch fashion, and so on. However, what has largely been ignored is a way to efficiently deploy these existing methods in a realistic, distributed enterprise integration environment. The straightforward use of existing methods often requires that all data be shipped to a coordinator for cleaning, which is often unacceptable. We develop a set of randomized algorithms that allow efficient application of existing entity resolution methods to the answering of aggregate queries over data that have been distributed across multiple sites. Using our methods, it is possible to efficiently generate aggregate query results that account for duplicate and inconsistent values scattered across a federated system.