Type-based categorization of relational attributes

Authors:
Babak Ahmadi;Marios Hadjieleftheriou;Thomas Seidl;Divesh Srivastava;Suresh Venkatasubramanian
Affiliations:
Fraunhofer IAIS;AT&T Labs Research;RWTH Aachen University;AT&T Labs Research;University of Utah
Venue:
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Year:
2009

Citing 16
Cited 3

Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Modern Information Retrieval

Modern Information Retrieval
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Efficient Discovery of Functional and Approximate Dependencies Using Partitions

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Semantic and schematic similarities between database objects: a context-based approach

The VLDB Journal — The International Journal on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Semantic-integration research in the database community

AI Magazine - Special issue on semantic integration
Multivariate information bottleneck

Neural Computation
Rapid Identification of Column Heterogeneity

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Validating Multi-column Schema Matchings by Type

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Information theory for data management

Proceedings of the VLDB Endowment
Information theory for data management

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Automatic discovery of attributes in relational databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and index optimization/compression. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft-clustering that exploit correlations between attributes to improve classification accuracy.