Mining database structure; or, how to build a data quality browser

Authors:
Tamraparni Dasu;Theodore Johnson;S. Muthukrishnan;Vladislav Shkapenyuk
Affiliations:
AT&T Labs-Research;AT&T Labs-Research;AT&T Labs-Research;AT&T Labs-Research
Venue:
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Year:
2002

Citing 25
Cited 46

A comparative analysis of methodologies for database schema integration

ACM Computing Surveys (CSUR)
Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Issues and approaches of database integration

Communications of the ACM
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Synopsis data structures for massive data sets

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Selectively estimation for Boolean queries

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
AJAX: an extensible data cleaning tool

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks

Data & Knowledge Engineering
Database-friendly random projections

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Clio: a semi-automatic tool for schema mapping

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
A Feasibility and Performance Study of Dependency Inference

Proceedings of the Fifth International Conference on Data Engineering
Efficient Discovery of Functional and Approximate Dependencies Using Partitions

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Schema Mapping as Query Discovery

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Identifying Representative Trends in Massive Time Series Data Sets Using Sketches

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Semantic and schematic similarities between database objects: a context-based approach

The VLDB Journal — The International Journal on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997

Data Quality in e-Business Applications

CAiSE '02/ WES '02 Revised Papers from the International Workshop on Web Services, E-Business, and the Semantic Web
Data Quality in Web Information Systems

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Estimating Rarity and Similarity over Data Stream Windows

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Comparing Data Streams Using Hamming Norms (How to Zero In)

IEEE Transactions on Knowledge and Data Engineering
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Information-theoretic tools for mining database structure from large data sets

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Domain-Driven Data Synopses for Dynamic Quantiles

IEEE Transactions on Knowledge and Data Engineering
Approximate Processing of Massive Continuous Quantile Queries over High-Speed Data Streams

IEEE Transactions on Knowledge and Data Engineering
GORDIAN: efficient and scalable discovery of composite keys

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Data quality awareness: a case study for cost optimal association rule mining

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Query relaxation using malleable schemas

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Comparing data streams using Hamming norms (how to zero in)

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Checks and balances: monitoring data quality problems in network traffic databases

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Referential integrity quality metrics

Decision Support Systems
Discovering topical structures of databases

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Discovering data quality rules

Proceedings of the VLDB Endowment
Unary and n-ary inclusion dependency discovery in relational databases

Journal of Intelligent Information Systems
Type-based categorization of relational attributes

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Distinct-value synopses for multiset operations

Communications of the ACM - A View of Parallel Computing
Extended aggregations for databases with referential integrity issues

Data & Knowledge Engineering
Automatic accuracy assessment via hashing in multiple-source environment

Expert Systems with Applications: An International Journal
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
An optimal algorithm for the distinct elements problem

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Estimating set intersection using small samples

ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
Rebuilding the world from views

WAIM'10 Proceedings of the 11th international conference on Web-age information management
On multi-column foreign key discovery

Proceedings of the VLDB Endowment
Data Auditor: exploring data quality and semantics using pattern tableaux

Proceedings of the VLDB Endowment
Materializing multi-relational databases from the web using taxonomic queries

Proceedings of the fourth ACM international conference on Web search and data mining
Wrangler: interactive visual specification of data transformation scripts

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Data collection by the people, for the people

CHI '11 Extended Abstracts on Human Factors in Computing Systems
Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Schema mapping with quality assurance for data integration

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Event correlation for process discovery from web service interaction logs

The VLDB Journal — The International Journal on Very Large Data Bases
Probabilistic quality assessment based on article's revision history

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Efficient classification from multiple heterogeneous databases

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Data cleansing for service-oriented architecture

EC-Web'05 Proceedings of the 6th international conference on E-Commerce and Web Technologies
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Discovering denial constraints

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data mining research typically assumes that the data to be analyzed has been identified, gathered, cleaned, and processed into a convenient form. While data mining tools greatly enhance the ability of the analyst to make data-driven discoveries, most of the time spent in performing an analysis is spent in data identification, gathering, cleaning and processing the data. Similarly, schema mapping tools have been developed to help automate the task of using legacy or federated data sources for a new purpose, but assume that the structure of the data sources is well understood. However the data sets to be federated may come from dozens of databases containing thousands of tables and tens of thousands of fields, with little reliable documentation about primary keys or foreign keys.We are developing a system, Bellman, which performs data mining on the structure of the database. In this paper, we present techniques for quickly identifying which fields have similar values, identifying join paths, estimating join directions and sizes, and identifying structures in the database. The results of the database structure mining allow the analyst to make sense of the database content. This information can be used to e.g., prepare data for data mining, find foreign key joins for schema mapping, or identify steps to be taken to prevent the database from collapsing under the weight of its complexity.