BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Integrating association rule mining with relational database systems: alternatives and implications
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
BOAT—optimistic decision tree construction
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
NonStop SQL/MX primitives for knowledge discovery
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques
Data mining: concepts and techniques
SQL database primitives for decision tree classifiers
Proceedings of the tenth international conference on Information and knowledge management
Fundamentals of Database Systems
Fundamentals of Database Systems
Database Systems: The Complete Book
Database Systems: The Complete Book
An Extension to SQL for Mining Association Rules
Data Mining and Knowledge Discovery
Optimizing Main-Memory Join on Modern Hardware
IEEE Transactions on Knowledge and Data Engineering
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total
ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Integrating Data Mining with SQL Databases: OLE DB for Data Mining
Proceedings of the 17th International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
User Defined Aggregates in Object-Relational Systems
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Spreadsheets in RDBMS for OLAP
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
SQL based frequent pattern mining without candidate generation
Proceedings of the 2004 ACM symposium on Applied computing
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
K-means clustering via principal component analysis
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Self-tuning cost modeling of user-defined functions in an object-relational DBMS
ACM Transactions on Database Systems (TODS)
A characterization of data mining algorithms on a modern processor
DaMoN '05 Proceedings of the 1st international workshop on Data management on new hardware
Processing-in-memory technology for knowledge discovery algorithms
DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
Vector and matrix operations programmed with UDFs in a relational DBMS
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Error minimization in approximate range aggregates
Data & Knowledge Engineering
Building statistical models and scoring with UDFs
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Intelligent Data Analysis
Approximate Query Processing in Cube Streams
IEEE Transactions on Knowledge and Data Engineering
ATLAS: a small but complete SQL extension for data mining and data streams
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Deterministic algorithms for sampling count data
Data & Knowledge Engineering
Adaptive-sampling algorithms for answering aggregation queries on Web sites
Data & Knowledge Engineering
Materialized Sample Views for Database Approximation
IEEE Transactions on Knowledge and Data Engineering
Fast support vector machine training and classification on graphics processors
Proceedings of the 25th international conference on Machine learning
Models for association rules based on clustering and correlation
Intelligent Data Analysis
Evaluating statistical tests on OLAP cubes to compare degree of disease
IEEE Transactions on Information Technology in Biomedicine - Special section on computational intelligence in medical systems
MAD skills: new analysis practices for big data
Proceedings of the VLDB Endowment
Comparing SQL and MapReduce to compute Naive Bayes in a single table scan
CloudDB '10 Proceedings of the second international workshop on Cloud data management
One-pass data mining algorithms in a DBMS with UDFs
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A data mining system based on SQL queries and UDFs for relational databases
Proceedings of the 20th ACM international conference on Information and knowledge management
Data mining algorithms as a service in the cloud exploiting relational database systems
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Hi-index | 0.00 |
User-Defined Functions (UDFs) represent an extensibility mechanism provided by most DBMSs, whose execution happens in main memory. Also, UDFs leverage the DBMS multi-threaded capabilities and exploit the C language speed and flexibility for mathematical computations. In this article, we study how to accelerate computation of sufficient statistics on large data sets with UDFs exploiting caching and sampling techniques. We present an aggregate UDF computing multidimensional sufficient statistics that benefit a broad array of statistical models: the linear sum of points and the quadratic sum of cross-products of point dimensions. Caching can be applied when the data set fits in main memory. Otherwise, sampling is required to accelerate processing of very large data sets. Also, sampling can be applied on data sets that can be cached, to further accelerate processing. Experiments carefully analyze performance and accuracy with real and synthetic data sets. We compare UDFs working inside the DBMS and C++ reading flat files, running on the same hardware. We show UDFs can have similar performance to C++, even if both exploit caching and multi-threading. As expected, C++ is much faster than UDFs when the data set is scanned from disk. We carefully analyze the case where sampling is required with larger data sets. We show geometric and bootstrapping sampling techniques can be faster than performing full tables scans, providing high accuracy estimation of mean, variance and correlation. Even further, sampling on cached data sets can provide accurate answers in a few seconds. Detailed experiments illustrate UDF optimizations including diagonal matrix computation, join avoidance and acceleration with a multi-core CPU, when available. A profile of UDF run-time execution shows the UDF is slowed down by I/O when reading from disk.