TupleRank: ranking discovered content in virtual databases

Authors:
Jacob Berlin;Amihai Motro
Affiliations:
Information and Software Engineering Department, George Mason University, Fairfax, VA;Information and Software Engineering Department, George Mason University, Fairfax, VA
Venue:
NGITS'06 Proceedings of the 6th international conference on Next Generation Information Technologies and Systems
Year:
2006

Citing 11
Cited 2

The TSIMMIS Approach to Mediation: Data Models and Languages

Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks

Data & Knowledge Engineering
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
Database Schema Matching Using Machine Learning with Feature Selection

CAiSE '02 Proceedings of the 14th International Conference on Advanced Information Systems Engineering
Multiplex: A Formal Model for Multidatabases and Its Implementation

NGIT '99 Proceedings of the 4th International Workshop on Next Generation Information Technologies and Systems
Autoplex: Automated Discovery of Content for Virtual Databases

CooplS '01 Proceedings of the 9th International Conference on Cooperative Information Systems
A Schema Analysis and Reconciliation Tool Environment for Heterogeneous Databases

IDEAS '99 Proceedings of the 1999 International Symposium on Database Engineering & Applications
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Data fusion

ACM Computing Surveys (CSUR)
Managing uncertainty in databases and scaling it up to concurrent transactions

SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, the problem of data integration has been newly addressed by methods based on machine learning and discovery. Such methods are intended to automate, at least in part, the laborious process of information integration, by which existing data sources are incorporated in a virtual database. Essentially, these methods scan new data sources, attempting to discover possible mappings to the virtual database. Like all discovery processes, this process is intrinsically probabilistic; that is, each discovery is associated with a specific value that denotes assurance of its appropriateness. Consequently, the rows in a discovered virtual table have mixed assurance levels, with some rows being more credible than others. We argue that rows in discovered virtual databases should be ranked, and we describe a ranking method, called TupleRank, for calculating such a ranking order. Roughly speaking, TupleRank calibrates the probabilities calculated during a discovery process with historical information about the performance of the system. The work is done in the framework of the Autoplex system for discovering content for virtual databases, and initial experimentation is reported and discussed.