A comparative analysis of methodologies for database schema integration
ACM Computing Surveys (CSUR)
Bifocal sampling for skew-resistant join size estimation
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Issues and approaches of database integration
Communications of the ACM
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Tracking join and self-join sizes in limited storage
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Synopsis data structures for massive data sets
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Selectively estimation for Boolean queries
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
AJAX: an extensible data cleaning tool
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data & Knowledge Engineering
Database-friendly random projections
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Reconciling schemas of disparate data sources: a machine-learning approach
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Clio: a semi-automatic tool for schema mapping
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
A Feasibility and Performance Study of Dependency Inference
Proceedings of the Fifth International Conference on Data Engineering
Efficient Discovery of Functional and Approximate Dependencies Using Partitions
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Schema Mapping as Query Discovery
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Identifying Representative Trends in Massive Time Series Data Sets Using Sketches
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes
Proceedings of the 27th International Conference on Very Large Data Bases
Semantic and schematic similarities between database objects: a context-based approach
The VLDB Journal — The International Journal on Very Large Data Bases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Data Quality in e-Business Applications
CAiSE '02/ WES '02 Revised Papers from the International Workshop on Web Services, E-Business, and the Semantic Web
Data Quality in Web Information Systems
ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Estimating Rarity and Similarity over Data Stream Windows
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Comparing Data Streams Using Hamming Norms (How to Zero In)
IEEE Transactions on Knowledge and Data Engineering
iMAP: discovering complex semantic matches between database schemas
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Information-theoretic tools for mining database structure from large data sets
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Domain-Driven Data Synopses for Dynamic Quantiles
IEEE Transactions on Knowledge and Data Engineering
Approximate Processing of Massive Continuous Quantile Queries over High-Speed Data Streams
IEEE Transactions on Knowledge and Data Engineering
GORDIAN: efficient and scalable discovery of composite keys
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Data quality awareness: a case study for cost optimal association rule mining
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Query relaxation using malleable schemas
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Comparing data streams using Hamming norms (how to zero in)
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Checks and balances: monitoring data quality problems in network traffic databases
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Referential integrity quality metrics
Decision Support Systems
Discovering topical structures of databases
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Tighter estimation using bottom k sketches
Proceedings of the VLDB Endowment
Discovering data quality rules
Proceedings of the VLDB Endowment
Unary and n-ary inclusion dependency discovery in relational databases
Journal of Intelligent Information Systems
Type-based categorization of relational attributes
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Distinct-value synopses for multiset operations
Communications of the ACM - A View of Parallel Computing
Extended aggregations for databases with referential integrity issues
Data & Knowledge Engineering
Automatic accuracy assessment via hashing in multiple-source environment
Expert Systems with Applications: An International Journal
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size
Proceedings of the VLDB Endowment
An optimal algorithm for the distinct elements problem
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling dirty data for matching attributes
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Estimating set intersection using small samples
ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
Rebuilding the world from views
WAIM'10 Proceedings of the 11th international conference on Web-age information management
On multi-column foreign key discovery
Proceedings of the VLDB Endowment
Data Auditor: exploring data quality and semantics using pattern tableaux
Proceedings of the VLDB Endowment
Materializing multi-relational databases from the web using taxonomic queries
Proceedings of the fourth ACM international conference on Web search and data mining
Wrangler: interactive visual specification of data transformation scripts
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Data collection by the people, for the people
CHI '11 Extended Abstracts on Human Factors in Computing Systems
Get the most out of your sample: optimal unbiased estimators using partial information
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Schema mapping with quality assurance for data integration
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Event correlation for process discovery from web service interaction logs
The VLDB Journal — The International Journal on Very Large Data Bases
Probabilistic quality assessment based on article's revision history
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Efficient classification from multiple heterogeneous databases
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Data cleansing for service-oriented architecture
EC-Web'05 Proceedings of the 6th international conference on E-Commerce and Web Technologies
Can we analyze big data inside a DBMS?
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Discovering denial constraints
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Data mining research typically assumes that the data to be analyzed has been identified, gathered, cleaned, and processed into a convenient form. While data mining tools greatly enhance the ability of the analyst to make data-driven discoveries, most of the time spent in performing an analysis is spent in data identification, gathering, cleaning and processing the data. Similarly, schema mapping tools have been developed to help automate the task of using legacy or federated data sources for a new purpose, but assume that the structure of the data sources is well understood. However the data sets to be federated may come from dozens of databases containing thousands of tables and tens of thousands of fields, with little reliable documentation about primary keys or foreign keys.We are developing a system, Bellman, which performs data mining on the structure of the database. In this paper, we present techniques for quickly identifying which fields have similar values, identifying join paths, estimating join directions and sizes, and identifying structures in the database. The results of the database structure mining allow the analyst to make sense of the database content. This information can be used to e.g., prepare data for data mining, find foreign key joins for schema mapping, or identify steps to be taken to prevent the database from collapsing under the weight of its complexity.