A case-study on learning from large-scale intracranial EEG data using multi-core machines and clusters

  • Authors:
  • Haimonti Dutta;Huascar Fiorletta;Manoj Pooleery;Hatim Diab;Stanley German;David Waltz;Catherine A. Schevon

  • Affiliations:
  • Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;The Columbia University Medical School (CUMC), Columbia University, New York, NY

  • Venue:
  • Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Epilepsy is a chronic neurological disorder characterized by recurrent, unprovoked seizures that manifest in a variety of ways, including emotional or behavioral disturbances, convulsive movements, and loss of awareness. The problem of prediction of epileptic seizures is hard and most algorithms do not perform better than a random predictor [20]. An important reason why studies so far have been less than successful is that electroencephalogram (EEG) is not recorded at the granularity of the seizure generation process. Our collaborators at the Columbia University Medical School (CUMC) have been involved in a clinical trial which entails implanting a Micro-Electrode Array directly into the neocortex of epilepsy patients undergoing surgery to remove the portion of the brain from where seizures originate. The 96-contact grid allows researchers to record at 30 KHz/channel which is a very high resolution data collection procedure compared to known state-of-the-art techniques and yields both local field and action potential data (.5 TB per patient per day). This large volume of data poses challenges for knowledge discovery and mining. In this paper, we describe the steps required for processing the EEG signal and extraction of features; we present a parallel design for scaling up processing on multi-core machines and an in-house cluster. Initial benchmarking results indicate that approximately 6-cores of a machine (processing speed of 2.7 GHz, 32 GB RAM, moderate workload) is sufficient to process a 5 minute chunk of data from 96 channels in approximately 12 mins. Encouraged by these results, we discuss design of other machine learning algorithms for learning from the data.