Can we analyze big data inside a DBMS?

Authors:
Carlos Ordonez
Affiliations:
University of Houston, Houston, TX, USA
Venue:
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Year:
2013

Citing 38
Cited 1

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Integrating association rule mining with relational database systems: alternatives and implications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
NonStop SQL/MX primitives for knowledge discovery

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
SQLEM: fast clustering in SQL using the EM algorithm

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SQL database primitives for decision tree classifiers

Proceedings of the tenth international conference on Information and knowledge management
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Discovery-Driven Exploration of OLAP Data Cubes

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Index Selection for OLAP

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Integrating Data Mining with SQL Databases: OLE DB for Data Mining

Proceedings of the 17th International Conference on Data Engineering
Spreadsheets in RDBMS for OLAP

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Vertical and horizontal percentage aggregations

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
SVM in oracle database 10g: removing the barriers to widespread adoption of support vector machines

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Integrating K-Means Clustering with a Relational DBMS Using SQL

IEEE Transactions on Knowledge and Data Engineering
Incremental approximate matrix factorization for speeding up support vector machines

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Referential integrity quality metrics

Decision Support Systems
The end of an architectural era: (it's time for a complete rewrite)

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
PLANET: massively parallel learning of tree ensembles with MapReduce

Proceedings of the VLDB Endowment
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
Overview of sciDB: large scale array storage, processing and analysis

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ricardo: integrating R and Hadoop

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HadoopDB in action: building real world applications

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Statistical Model Computation with UDFs

IEEE Transactions on Knowledge and Data Engineering
On the Computation of Stochastic Search Variable Selection in Linear Regression with UDFs

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
ArrayStore: a storage manager for complex parallel array processing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Dynamic optimization of generalized SQL queries with horizontal aggregations

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
Fast PCA computation in a DBMS with aggregate UDFs and LAPACK

Proceedings of the 21st ACM international conference on Information and knowledge management
BigBench: towards an industry standard benchmark for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

DOLAP 2013 workshop summary

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Relational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical languages, matrix packages, generic data mining programs and large-scale parallel systems, being the main technology for big data analytics. Such large-scale systems are mostly based on the Hadoop distributed file system and MapReduce. Thus it would seem a DBMS is not a good technology to analyze big data, going beyond SQL queries, acting just as a reliable and fast data repository. In this survey, we argue that is not the case, explaining important research that has enabled analytics on large databases inside a DBMS. However, we also argue DBMSs cannot compete with parallel systems like MapReduce to analyze web-scale text data. Therefore, each technology will keep influencing each other. We conclude with a proposal of long-term research issues, considering the "big data analytics" trend.