Information theory for data management

Authors:
Divesh Srivastava;Suresh Venkatasubramanian
Affiliations:
AT&T Labs--Research;University of Utah
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 12
Cited 1

Information dependencies

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On the design and quantification of privacy preserving data mining algorithms

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An information-theoretic approach to normal forms for relational and XML data

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Limiting privacy breaches in privacy preserving data mining

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Privacy preserving mining of association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Information-theoretic tools for mining database structure from large data sets

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
An information-theoretic approach to normal forms for relational and XML data

Journal of the ACM (JACM)
An information theoretic model for database alignment

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management
Rapid Identification of Column Heterogeneity

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Type-based categorization of relational attributes

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Validating Multi-column Schema Matchings by Type

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Truth finding on the deep web: is the problem solved?

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.01

Visualization

Abstract

We are awash in data. The explosion in computing power and computing infrastructure allows us to generate multitudes of data, in differing formats, at different scales, and in inter-related areas. Data management is fundamentally about the harnessing of this data to extract information, discovering good representations of the information, and analyzing information sources to glean structure. Data management generally presents us with cost-benefit tradeoffs. If we store more information, we get better answers to queries, but we pay the price in terms of increased storage. Conversely, reducing the amount of information we store improves performance at the cost of decreased accuracy for query results. The ability to quantify information gain or loss can only improve our ability to design good representations, storage mechanisms, and analysis tools for data.