Data Squashing by Empirical Likelihood

Authors:
Art Owen
Affiliations:
Department of Statistics, Stanford University, Sequoia Hall, Stanford CA 94025, USA. owen@stat.stanford.edu
Venue:
Data Mining and Knowledge Discovery
Year:
2003

Citing 5
Cited 5

Stochastic simulation

Stochastic simulation
A guide to simulation (2nd ed.)

A guide to simulation (2nd ed.)
Squashing flat files flatter

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction

Data Mining and Knowledge Discovery
Rule-based statistical calculations on a database abstract

Rule-based statistical calculations on a database abstract

Bit Reduction Support Vector Machine

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Empirical likelihood confidence intervals for differences between two datasets with missing data

Pattern Recognition Letters
Estimating confidence intervals for structural differences between contrast groups with missing data

Expert Systems with Applications: An International Journal
Fast support vector machines for continuous data

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on cybernetics and cognitive informatics
Combining kNN Imputation and Bootstrap Calibrated: Empirical Likelihood for Incomplete Data Analysis

International Journal of Data Warehousing and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data squashing was introduced by W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon, in Proceedings of the 5th International Conference on KDD (1999). The idea is to scale data sets down to smaller representative samples instead of scaling up algorithms to very large data sets. They report success in learning model coefficients on squashed data. This paper presents a form of data squashing based on empirical likelihood. This method reweights a random sample of data to match certain expected values to the population. The computation required is a relatively easy convex optimization. There is also a theoretical basis to predict when it will and won't produce large gains. In a credit scoring example, empirical likelihood weighting also accelerates the rate at which coefficients are learned. We also investigate the extent to which these benefits translate into improved accuracy, and consider reweighting in conjunction with boosted decision trees.