Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

Authors:
Carlos Ordonez;Sasi K. Pitchaimalai
Affiliations:
University of Houston, Houston, TX 77204, USA;University of Houston, Houston, TX 77204, USA
Venue:
Data & Knowledge Engineering
Year:
2010

Citing 36
Cited 4

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Integrating association rule mining with relational database systems: alternatives and implications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
NonStop SQL/MX primitives for knowledge discovery

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques

Data mining: concepts and techniques
SQL database primitives for decision tree classifiers

Proceedings of the tenth international conference on Information and knowledge management
Fundamentals of Database Systems

Fundamentals of Database Systems
Database Systems: The Complete Book

Database Systems: The Complete Book
An Extension to SQL for Mining Association Rules

Data Mining and Knowledge Discovery
Optimizing Main-Memory Join on Modern Hardware

IEEE Transactions on Knowledge and Data Engineering
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Integrating Data Mining with SQL Databases: OLE DB for Data Mining

Proceedings of the 17th International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
User Defined Aggregates in Object-Relational Systems

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Spreadsheets in RDBMS for OLAP

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
SQL based frequent pattern mining without candidate generation

Proceedings of the 2004 ACM symposium on Applied computing
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
K-means clustering via principal component analysis

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Self-tuning cost modeling of user-defined functions in an object-relational DBMS

ACM Transactions on Database Systems (TODS)
A characterization of data mining algorithms on a modern processor

DaMoN '05 Proceedings of the 1st international workshop on Data management on new hardware
Processing-in-memory technology for knowledge discovery algorithms

DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
Vector and matrix operations programmed with UDFs in a relational DBMS

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Error minimization in approximate range aggregates

Data & Knowledge Engineering
Building statistical models and scoring with UDFs

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A new PCA-based method for data compression and enhancement of multi-frequency polarimetric SAR imagery

Intelligent Data Analysis
Approximate Query Processing in Cube Streams

IEEE Transactions on Knowledge and Data Engineering
ATLAS: a small but complete SQL extension for data mining and data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Deterministic algorithms for sampling count data

Data & Knowledge Engineering
Adaptive-sampling algorithms for answering aggregation queries on Web sites

Data & Knowledge Engineering
Materialized Sample Views for Database Approximation

IEEE Transactions on Knowledge and Data Engineering
Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on Machine learning
Models for association rules based on clustering and correlation

Intelligent Data Analysis
Evaluating statistical tests on OLAP cubes to compare degree of disease

IEEE Transactions on Information Technology in Biomedicine - Special section on computational intelligence in medical systems
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment

Comparing SQL and MapReduce to compute Naive Bayes in a single table scan

CloudDB '10 Proceedings of the second international workshop on Cloud data management
One-pass data mining algorithms in a DBMS with UDFs

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A data mining system based on SQL queries and UDFs for relational databases

Proceedings of the 20th ACM international conference on Information and knowledge management
Data mining algorithms as a service in the cloud exploiting relational database systems

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

User-Defined Functions (UDFs) represent an extensibility mechanism provided by most DBMSs, whose execution happens in main memory. Also, UDFs leverage the DBMS multi-threaded capabilities and exploit the C language speed and flexibility for mathematical computations. In this article, we study how to accelerate computation of sufficient statistics on large data sets with UDFs exploiting caching and sampling techniques. We present an aggregate UDF computing multidimensional sufficient statistics that benefit a broad array of statistical models: the linear sum of points and the quadratic sum of cross-products of point dimensions. Caching can be applied when the data set fits in main memory. Otherwise, sampling is required to accelerate processing of very large data sets. Also, sampling can be applied on data sets that can be cached, to further accelerate processing. Experiments carefully analyze performance and accuracy with real and synthetic data sets. We compare UDFs working inside the DBMS and C++ reading flat files, running on the same hardware. We show UDFs can have similar performance to C++, even if both exploit caching and multi-threading. As expected, C++ is much faster than UDFs when the data set is scanned from disk. We carefully analyze the case where sampling is required with larger data sets. We show geometric and bootstrapping sampling techniques can be faster than performing full tables scans, providing high accuracy estimation of mean, variance and correlation. Even further, sampling on cached data sets can provide accurate answers in a few seconds. Detailed experiments illustrate UDF optimizations including diagonal matrix computation, join avoidance and acceleration with a multi-core CPU, when available. A profile of UDF run-time execution shows the UDF is slowed down by I/O when reading from disk.