A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data

  • Authors:
  • Yubin Park;Joydeep Ghosh

  • Affiliations:
  • The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA

  • Venue:
  • Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRR1 or HSA2). Such levels constitute partitions of the underlying individual level data, which may not match the data segments that would have been obtained if one clustered individual-level data. Treating these aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? We examine this problem in a clustering setting given a mix of individual-level and (arbitrarily) aggregated-level data. For this setting, a generative process of such data is constructed using a Bayesian directed graphical model. This model is further developed to capture the properties of the aggregated-level data using the Central Limit theorem. The model provides reasonable cluster centroids under certain conditions, and is extended to estimate the masked individual values for the aggregated data. The model parameters are learned using an approximated Gibbs sampling method, which employs the Metropolis-Hastings algorithm efficiently. A deterministic approximation algorithm is derived from the model, which scales up to massive datasets. Furthermore, the imputed features can help to improve the performance in subsequent predictive modeling tasks. Experimental results using data from the Dartmouth Health Atlas, CDC, and the U.S. Census Bureau are provided to illustrate the generality and capabilities of the proposed framework.