Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Distributional clustering of words for text classification
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering using word clusters via the information bottleneck method
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Pattern Recognition with Fuzzy Objective Function Algorithms
Pattern Recognition with Fuzzy Objective Function Algorithms
Modern Information Retrieval
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
Efficient Discovery of Functional and Approximate Dependencies Using Partitions
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Semantic and schematic similarities between database objects: a context-based approach
The VLDB Journal — The International Journal on Very Large Data Bases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Semantic-integration research in the database community
AI Magazine - Special issue on semantic integration
Multivariate information bottleneck
Neural Computation
Rapid Identification of Column Heterogeneity
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Validating Multi-column Schema Matchings by Type
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Information theory for data management
Proceedings of the VLDB Endowment
Information theory for data management
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Automatic discovery of attributes in relational databases
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Hi-index | 0.00 |
In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and index optimization/compression. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft-clustering that exploit correlations between attributes to improve classification accuracy.