Regression analysis for massive datasets

Authors:
Tsai-Hung Fan;Dennis K. J. Lin;Kuang-Fu Cheng
Affiliations:
Graduate Institute of Statistics, National Central University, 300 Jhongda Road, Jhongli 320, Taiwan, ROC;Department of Supply Chain and Information Systems, The Pennsylvania State University, PA, United States;Graduate Institute of Statistics, National Central University, 300 Jhongda Road, Jhongli 320, Taiwan, ROC
Venue:
Data & Knowledge Engineering
Year:
2007

Citing 5
Cited 2

A statistical perspective on knowledge discovery in databases

Advances in knowledge discovery and data mining
Approximate medians and other quantiles in one pass and with limited memory

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Statistical Themes and Lessons for Data Mining

Data Mining and Knowledge Discovery
A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets

Data Mining and Knowledge Discovery
Optimal Time-Space Trade-Offs for Sorting

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science

Tests and variables selection on regression analysis for massive datasets

Data & Knowledge Engineering
Adaptive kernel smoothing regression for spatio-temporal environmental datasets

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the past decades, we have witnessed a revolution in information technology. Routine collection of systematically generated data is now commonplace. Databases with hundreds of fields (variables), and billions of records (observations) are not unusual. This presents a difficulty for classical data analysis methods, mainly due to the limitation of computer memory and computational costs (in time, for example). In this paper, we propose an intelligent regression analysis methodology which is suitable for modeling massive datasets. The basic idea here is to split the entire dataset into several blocks, applying the classical regression techniques for data in each block, and finally combining these regression results via weighted averages. Theoretical justification of the goodness of the proposed method is given, and empirical performance based on extensive simulation study is discussed.