Splash: ad-hoc querying of data and statistical models

Authors:
Lujun Fang;Kristen LeFevre
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI
Venue:
Proceedings of the 13th International Conference on Extending Database Technology
Year:
2010

Citing 30
Cited 1

Parallel database systems: the future of high performance database systems

Communications of the ACM
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Integrating association rule mining with relational database systems: alternatives and implications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A classification-based methodology for planning audit strategies in fraud detection

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
SQL database primitives for decision tree classifiers

Proceedings of the tenth international conference on Information and knowledge management
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Integrating Data Mining with SQL Databases: OLE DB for Data Mining

Proceedings of the 17th International Conference on Data Engineering
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A Sense of Self for Unix Processes

SP '96 Proceedings of the 1996 IEEE Symposium on Security and Privacy
Anomaly detection of web-based attacks

Proceedings of the 10th ACM conference on Computer and communications security
A Serial Combination of Anomaly and Misuse IDSes Applied to HTTP Traffic

ACSAC '04 Proceedings of the 20th Annual Computer Security Applications Conference
Prediction cubes

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Finding Representative Set from Massive Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Hybrid Network Intrusion Detection Technique Using Random Forests

ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
MauveDB: supporting model-based user views in database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
YALE: rapid prototyping for complex data mining tasks

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Building statistical models and scoring with UDFs

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Data mining approaches for intrusion detection

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
An overview of anomaly detection techniques: Existing solutions and latest technological trends

Computer Networks: The International Journal of Computer and Telecommunications Networking
Hippocratic databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficiently answering top-k typicality queries on large databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Detecting anomalous access patterns in relational databases

The VLDB Journal — The International Journal on Very Large Data Bases
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
Using trees to depict a forest

Proceedings of the VLDB Endowment
A comprehensive approach to anomaly detection in relational databases

DBSec'05 Proceedings of the 19th annual IFIP WG 11.3 working conference on Data and Applications Security
A learning-based approach to the detection of SQL attacks

DIMVA'05 Proceedings of the Second international conference on Detection of Intrusions and Malware, and Vulnerability Assessment

On Armstrong-compliant logical query languages

Proceedings of the 4th International Workshop on Logic in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data mining is increasingly performed by people who are not computer scientists or professional programmers. It is often done as an iterative process involving multiple ad-hoc tasks, as well as data pre- and post-processing, all of which must be executed over large databases. In order to make data mining more accessible, it is critical to provide a simple, easy-to-use language that allows the user to specify ad-hoc data processing, model construction, and model manipulation. Simultaneously, it is necessary for the underlying system to scale up to large datasets. Unfortunately, while each of these requirements can be satisfied, individually, by existing systems, no system fully satisfies all criteria. In this paper, we present a system called Splash to fill this void. Splash supports an extended relational data model and SQL query language, which allows for the natural integration of statistical modeling and ad-hoc data processing. It also supports a novel representatives operator to help explain models using a limited number of examples. We have developed a prototype implementation of Splash. Our experimental study indicates that it scales well to large input datasets. Further, to demonstrate the simplicity of the language, we conducted a case study using Splash to perform a series of exploratory analyses using network log data. Our study indicates that the query-based interface is simpler than a common data mining software package, and it often requires less programming effort to use.