A generative framework for predictive modeling using variably aggregated, multi-source healthcare data

Authors:
Yubin Park;Joydeep Ghosh
Affiliations:
University of Texas at Austin, Austin, TX, USA;University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 2011 workshop on Data mining for medicine and healthcare
Year:
2011

Citing 6
Cited 1

Privacy-preserving data mining

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Informational privacy, data mining, and theInternet

Ethics and Information Technology
Latent dirichlet allocation

The Journal of Machine Learning Research
Clustering with Bregman Divergences

The Journal of Machine Learning Research
Regression-based latent factor models

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Differential privacy

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II

CUDIA: Probabilistic cross-level imputation using individual auxiliary information

ACM Transactions on Intelligent Systems and Technology (TIST) - Survey papers, special sections on the semantic adaptive social web, intelligent systems for health informatics, regular papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many measures of healthcare delivery or quality are not publicly available at the individual patient or hospital level largely due to privacy restrictions, legal issues or reporting norms. Instead, such measures are provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRR1s and HSA2s). Such levels constitute partitionings of the underlying individual level data into segments that may not match the data clusters that would have been obtained if one analyzed individual-level data. Moreover, different data sources may use different underlying partitions as the bases for their data summarization. How can one run data mining procedures such as clustering or regression on data where different variables are available at different levels of aggregation or granularity? We first examine this problem in a clustering setting given a mix of individual-level and (arbitrarily) aggregated level data. For this setting, we present an extension of the Latent Dirichlet Allocation model that can use such aggregated information. The model provides reasonable cluster centroids under certain conditions, and is extended to impute masked features at the individual-level. The imputed feature values are based on an underlying mixture distribution, and help to improve the performance in subsequent predictive modeling tasks. The model parameters are learned using an approximated Gibbs sampling method, which employs the Metropolis-Hastings algorithm efficiently. Experimental results using data from the Dartmouth Health Atlas, CDC, and the U.S. Census Bureau are provided to illustrate the generality and capabilities of the proposed framework.