Discovering topical structures of databases

Authors:
Wensheng Wu;Berthold Reinwald;Yannis Sismanis;Rajesh Manjrekar
Affiliations:
IBM Almaden Research Center, San Jose, CA, USA;IBM Almaden Research Center, San Jose, CA, USA;IBM Almaden Research Center, San Jose, CA, USA;IBM Almaden Research Center, San Jose, CA, USA
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 20
Cited 8

ER model clustering as an aid for user communication and documentation in database design

Communications of the ACM
Conceptual schema analysis: techniques and applications

ACM Transactions on Database Systems (TODS)
SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks

Data & Knowledge Engineering
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Multi-User View Integration System (MUVIS): An Expert System for View Integration

Proceedings of the Sixth International Conference on Data Engineering
Multistrategy Learning for Information Extraction

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Information-theoretic tools for mining database structure from large data sets

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Matching large XML schemas

ACM SIGMOD Record
Clustering Aggregation

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Semantic-integration research in the database community

AI Magazine - Special issue on semantic integration
Schema summarization

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
GORDIAN: efficient and scalable discovery of composite keys

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Incremental schema matching

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Beauty and the beast: the theory and practice of information integration

ICDT'07 Proceedings of the 11th international conference on Database Theory
Heuristic strategies for the discovery of inclusion dependencies and other patterns

Journal on Data Semantics V
Toward automated large-scale information integration and discovery

Data Management in a Connected World

Natural language reporting for ETL processes

Proceedings of the ACM 11th international workshop on Data warehousing and OLAP
Query by output

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Representation of conceptual ETL designs in natural language using Semantic Web technology

Data & Knowledge Engineering
Alternative query generation for XML keyword search and its optimization

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Ontology guided data linkage framework for discovering meaningful data facts

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Latent topics in graph-structured data

Proceedings of the 21st ACM international conference on Information and knowledge management
Automated discovery of multi-faceted ontologies for accurate query answering and future semantic reasoning

Data & Knowledge Engineering
Data profiling revisited

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing complexity of enterprise databases and the prevalent lack of documentation incur significant cost in both understanding and integrating the databases. Existing solutions addressed mining for keys and foreign keys, but paid little attention to more high-level structures of databases. In this paper, we consider the problem of discovering topical structures of databases to support semantic browsing and large-scale data integration. We describe iDisc, a novel discovery system based on a multi-strategy learning framework. iDisc exploits varied evidence in database schema and instance values to construct multiple kinds of database representations. It employs a set of base clusterers to discover preliminary topical clusters of tables from database representations, and then aggregate them into final clusters via meta-clustering. To further improve the accuracy, we extend iDisc with novel multiple-level aggregation and clusterer boosting techniques. We introduce a new measure on table importance and propose an approach to discovering cluster representatives to facilitate semantic browsing. An important feature of our framework is that it is highly extensible, where additional database representations and base clusterers may be easily incorporated into the framework. We have extensively evaluated iDisc using large real-world databases and results show that it discovers topical structures with a high degree of accuracy.