Privacy-preserving data mining
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Informational privacy, data mining, and theInternet
Ethics and Information Technology
The Journal of Machine Learning Research
Clustering with Bregman Divergences
The Journal of Machine Learning Research
Regression-based latent factor models
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Differential Privacy for Clinical Trial Data: Preliminary Evaluations
ICDMW '09 Proceedings of the 2009 IEEE International Conference on Data Mining Workshops
On smoothing and inference for topic models
UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II
CUDIA: Probabilistic cross-level imputation using individual auxiliary information
ACM Transactions on Intelligent Systems and Technology (TIST) - Survey papers, special sections on the semantic adaptive social web, intelligent systems for health informatics, regular papers
Hi-index | 0.00 |
In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRR1 or HSA2). Such levels constitute partitions of the underlying individual level data, which may not match the data segments that would have been obtained if one clustered individual-level data. Treating these aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? We examine this problem in a clustering setting given a mix of individual-level and (arbitrarily) aggregated-level data. For this setting, a generative process of such data is constructed using a Bayesian directed graphical model. This model is further developed to capture the properties of the aggregated-level data using the Central Limit theorem. The model provides reasonable cluster centroids under certain conditions, and is extended to estimate the masked individual values for the aggregated data. The model parameters are learned using an approximated Gibbs sampling method, which employs the Metropolis-Hastings algorithm efficiently. A deterministic approximation algorithm is derived from the model, which scales up to massive datasets. Furthermore, the imputed features can help to improve the performance in subsequent predictive modeling tasks. Experimental results using data from the Dartmouth Health Atlas, CDC, and the U.S. Census Bureau are provided to illustrate the generality and capabilities of the proposed framework.