A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data

Authors:
Yubin Park;Joydeep Ghosh
Affiliations:
The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Year:
2012

Citing 8
Cited 1

Privacy-preserving data mining

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Informational privacy, data mining, and theInternet

Ethics and Information Technology
Latent dirichlet allocation

The Journal of Machine Learning Research
Clustering with Bregman Divergences

The Journal of Machine Learning Research
Regression-based latent factor models

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Differential Privacy for Clinical Trial Data: Preliminary Evaluations

ICDMW '09 Proceedings of the 2009 IEEE International Conference on Data Mining Workshops
On smoothing and inference for topic models

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Differential privacy

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II

CUDIA: Probabilistic cross-level imputation using individual auxiliary information

ACM Transactions on Intelligent Systems and Technology (TIST) - Survey papers, special sections on the semantic adaptive social web, intelligent systems for health informatics, regular papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRR1 or HSA2). Such levels constitute partitions of the underlying individual level data, which may not match the data segments that would have been obtained if one clustered individual-level data. Treating these aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? We examine this problem in a clustering setting given a mix of individual-level and (arbitrarily) aggregated-level data. For this setting, a generative process of such data is constructed using a Bayesian directed graphical model. This model is further developed to capture the properties of the aggregated-level data using the Central Limit theorem. The model provides reasonable cluster centroids under certain conditions, and is extended to estimate the masked individual values for the aggregated data. The model parameters are learned using an approximated Gibbs sampling method, which employs the Metropolis-Hastings algorithm efficiently. A deterministic approximation algorithm is derived from the model, which scales up to massive datasets. Furthermore, the imputed features can help to improve the performance in subsequent predictive modeling tasks. Experimental results using data from the Dartmouth Health Atlas, CDC, and the U.S. Census Bureau are provided to illustrate the generality and capabilities of the proposed framework.