Tests and variables selection on regression analysis for massive datasets

Authors:
Tsai-Hung Fan;Kuang-Fu Cheng
Affiliations:
Graduate Institute of Statistics, National Central University, Chungli, Taiwan, ROC;Graduate Institute of Statistics, National Central University, Chungli, Taiwan, ROC
Venue:
Data & Knowledge Engineering
Year:
2007

Citing 8
Cited 2

A statistical perspective on knowledge discovery in databases

Advances in knowledge discovery and data mining
Data mining: concepts and techniques

Data mining: concepts and techniques
Mastering Data Mining: The Art and Science of Customer Relationship Management

Mastering Data Mining: The Art and Science of Customer Relationship Management
Statistical Themes and Lessons for Data Mining

Data Mining and Knowledge Discovery
A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets

Data Mining and Knowledge Discovery
Optimal Time-Space Trade-Offs for Sorting

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Support vector clustering

The Journal of Machine Learning Research
Regression analysis for massive datasets

Data & Knowledge Engineering

Comparison of approaches for estimating reliability of individual regression predictions

Data & Knowledge Engineering
Semantic information integration and question answering based on pervasive agent ontology

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

According to Lindley's paradox, most point null hypotheses will be rejected when the sample size is too large. In this paper, a two-stage block testing procedure is proposed for massive data regression analysis. New variables selection criteria incorporating with classical stepwise procedure are also developed to select significant explanatory variables. Our approach is not only simple in computation for massive data but also confirmed by the simulation study that our approach is more accurate in the sense of achieving the nominal significance level for huge data sets. A real example with moderate sample size verifies that the proposed procedure is accurate compared with the classical method, and a huge real data set is also demonstrated to select appropriate regressors.