ER model clustering as an aid for user communication and documentation in database design
Communications of the ACM
Conceptual schema analysis: techniques and applications
ACM Transactions on Database Systems (TODS)
Data & Knowledge Engineering
Reconciling schemas of disparate data sources: a machine-learning approach
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Multi-User View Integration System (MUVIS): An Expert System for View Integration
Proceedings of the Sixth International Conference on Data Engineering
Multistrategy Learning for Information Extraction
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Information-theoretic tools for mining database structure from large data sets
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
ACM SIGMOD Record
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Semantic-integration research in the database community
AI Magazine - Special issue on semantic integration
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
GORDIAN: efficient and scalable discovery of composite keys
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Beauty and the beast: the theory and practice of information integration
ICDT'07 Proceedings of the 11th international conference on Database Theory
Heuristic strategies for the discovery of inclusion dependencies and other patterns
Journal on Data Semantics V
Toward automated large-scale information integration and discovery
Data Management in a Connected World
Natural language reporting for ETL processes
Proceedings of the ACM 11th international workshop on Data warehousing and OLAP
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Representation of conceptual ETL designs in natural language using Semantic Web technology
Data & Knowledge Engineering
Alternative query generation for XML keyword search and its optimization
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Ontology guided data linkage framework for discovering meaningful data facts
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Latent topics in graph-structured data
Proceedings of the 21st ACM international conference on Information and knowledge management
Data & Knowledge Engineering
ACM SIGMOD Record
Hi-index | 0.00 |
The increasing complexity of enterprise databases and the prevalent lack of documentation incur significant cost in both understanding and integrating the databases. Existing solutions addressed mining for keys and foreign keys, but paid little attention to more high-level structures of databases. In this paper, we consider the problem of discovering topical structures of databases to support semantic browsing and large-scale data integration. We describe iDisc, a novel discovery system based on a multi-strategy learning framework. iDisc exploits varied evidence in database schema and instance values to construct multiple kinds of database representations. It employs a set of base clusterers to discover preliminary topical clusters of tables from database representations, and then aggregate them into final clusters via meta-clustering. To further improve the accuracy, we extend iDisc with novel multiple-level aggregation and clusterer boosting techniques. We introduce a new measure on table importance and propose an approach to discovering cluster representatives to facilitate semantic browsing. An important feature of our framework is that it is highly extensible, where additional database representations and base clusterers may be easily incorporated into the framework. We have extensively evaluated iDisc using large real-world databases and results show that it discovers topical structures with a high degree of accuracy.