CUDIA: Probabilistic cross-level imputation using individual auxiliary information

Authors:
Yubin Park;Joydeep Ghosh
Affiliations:
The University of Texas at Austin;The University of Texas at Austin
Venue:
ACM Transactions on Intelligent Systems and Technology (TIST) - Survey papers, special sections on the semantic adaptive social web, intelligent systems for health informatics, regular papers
Year:
2013

Citing 10
Cited 1

C4.5: programs for machine learning

C4.5: programs for machine learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Clustering with Bregman Divergences

The Journal of Machine Learning Research
Regression-based latent factor models

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
On smoothing and inference for topic models

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
A generative framework for predictive modeling using variably aggregated, multi-source healthcare data

Proceedings of the 2011 workshop on Data mining for medicine and healthcare
Differential privacy

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II
A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Our data, ourselves: privacy via distributed noise generation

EUROCRYPT'06 Proceedings of the 24th annual international conference on The Theory and Applications of Cryptographic Techniques
Calibrating noise to sensitivity in private data analysis

TCC'06 Proceedings of the Third conference on Theory of Cryptography

Smart Health and Wellbeing

ACM Transactions on Management Information Systems (TMIS) - Special Issue on Informatics for Smart Health and Wellbeing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues, or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones, such as hospital referral regions (HRR) or hospital service areas (HSA). Such levels constitute partitions over the underlying individual level data, which may not match the groupings that would have been obtained if one clustered the data based on individual-level attributes. Moreover, treating aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this article, we seek a better utilization of variably aggregated datasets, which are possibly assembled from different sources. We propose a novel cross-level imputation technique that models the generative process of such datasets using a Bayesian directed graphical model. The imputation is based on the underlying data distribution and is shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, yielding improved accuracies. The experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework.