Machine Learning
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Bioinformatics
MapReduce: a flexible data processing tool
Communications of the ACM - Amir Pnueli: Ahead of His Time
Large-scale multimodal mining for healthcare with mapreduce
Proceedings of the 1st ACM International Health Informatics Symposium
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Hadoop: The Definitive Guide
Full-text indexing for optimizing selection operations in large-scale data analytics
Proceedings of the second international workshop on MapReduce and its applications
Comparing high level mapreduce query languages
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Better medicine through machine learning
Communications of the ACM
Hi-index | 0.00 |
Hypertrophic Cardiomyopathy (HCM), an inherited heart disease, is the most common cause of sudden cardiac death in young athletes. Successful diagnosis of mild HCM presents a major medical challenge, especially in athletes with exercise-induced hypertrophy that overlaps with HCM. This is due to a wide spectrum of non-specific clinical parameters and their complex dependencies. Recently, medical researchers proposed multidisciplinary strategies, defining differential diagnostic scoring algorithms, with the goal of identifying which parameters correlate with HCM in order to achieve faster and more accurate diagnosis. These algorithms require extensive testing against large medical datasets in order to identify potential correlations, and assess the overall algorithmic quality and diagnostic accuracy. We present a prototype data-parallel algorithm for improving the diagnosis of mild HCM, by refining the set of parameters contributing to the main diagnostic function. To this end, we employ a rule-based, machine-learning approach and develop an iterative MapReduce application for applying the diagnostic function on large data-sets. The core component of the algorithm, including the diagnostic function, has been implemented in Java, Pig and Hive in order to identify potential productivity gains by using a high-level MapReduce language specifically for medical applications. Finally, we assess the algorithmic performance on up to 64 cores of our Hadoop (version 0.20.1) enabled Beowulf cluster, managing to achieve near-linear speedups while reducing the overall runtime from over 9 hours to a couple of minutes for a realistic dataset of 10,000 medical records.