Horizontal aggregations for building tabular data sets

Authors:
Carlos Ordonez
Affiliations:
Teradata, NCR, San Diego, CA
Venue:
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Year:
2004

Citing 22
Cited 2

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Research problems in data warehousing

CIKM '95 Proceedings of the fourth international conference on Information and knowledge management
Hypergraph based reorderings of outer join queries with complex predicates

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Outerjoin simplification and reordering for query optimization

ACM Transactions on Database Systems (TODS)
Integrating association rule mining with relational database systems: alternatives and implications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A unifying review of linear Gaussian models

Neural Computation
The PanQ tool and EMF SQL for complex data management

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
NonStop SQL/MX primitives for knowledge discovery

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Extending the database relational model to capture more meaning

ACM Transactions on Database Systems (TODS)
SQLEM: fast clustering in SQL using the EM algorithm

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A relational model of data for large shared data banks

Communications of the ACM
SQL database primitives for decision tree classifiers

Proceedings of the tenth international conference on Information and knowledge management
FREM: fast and robust EM clustering for large data sets

Proceedings of the eleventh international conference on Information and knowledge management
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Spreadsheets in RDBMS for OLAP

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Clustering binary data streams with K-means

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Processing frequent itemset discovery queries by division and set containment join operators

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Clustering gene expression data in SQL using locally adaptive metrics

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Vertical and horizontal percentage aggregations

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
COMBI-operator - database support for data mining applications

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
ATLAS: a small but complete SQL extension for data mining and data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Vector and matrix operations programmed with UDFs in a relational DBMS

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient computation of PCA with SVD in SQL

Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a data mining project, a significant portion of time is devoted to building a data set suitable for analysis. In a relational database environment, building such data set usually requires joining tables and aggregating columns with SQL queries. Existing SQL aggregations are limited since they return a single number per aggregated group, producing one row for each computed number. These aggregations help, but a significant effort is still required to build data sets suitable for data mining purposes, where a tabular format is generally required. This work proposes very simple, yet powerful, extensions to SQL aggregate functions to produce aggregations in tabular form, returning a set of numbers instead of one number per row. We call this new class of functions horizontal aggregations. Horizontal aggregations help building answer sets in tabular form (e.g. point-dimension, observation-variable, instance-feature), which is the standard form needed by most data mining algorithms. Two common data preparation tasks are explained, including transposition/aggregation and transforming categorical attributes into binary dimensions. We propose two strategies to evaluate horizontal aggregations using standard SQL. The first strategy is based only on relational operators and the second one uses the "case" construct. Experiments with large data sets study the proposed query optimization strategies.