HIV-1 Drug Resistance Prediction and Therapy Optimization: A Case Study for the Application of Classification and Clustering Methods

  • Authors:
  • Michal Rosen-Zvi;Ehud Aharoni;Joachim Selbig

  • Affiliations:
  • IBM Research Laboratory in Haifa, Haifa University, Haifa, Israel 31905;IBM Research Laboratory in Haifa, Haifa University, Haifa, Israel 31905;Institute of Biochemistry and Biology, Max Planck Institute of Molecular Plant Physiology, University of Potsdam, Potsdam-Golm, Germany D-14476

  • Venue:
  • Similarity-Based Clustering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This chapter provides a review of the challenges machine-learning specialists face when trying to assist virologists by generating an automatic prediction of an outcome of HIV therapy. Optimizing HIV therapies is crucial since the virus rapidly develops mutations to evade drug pressures. Modern anti-HIV regimens comprise multiple drugs in order to prevent, or at least delay, the development of resistance mutations. In recent years, large databases have been collected to allow the automatic analysis of relations between the virus genome other clinical and demographical information, and the failure or success of a therapy. The EuResist integrated database (EID) collected from about 18500 patients and 65000 different therapies is probably one of the largest clinical genomic databases. Only one third of the therapies in the EID contain therapy response information and only 5% of the therapy records have response information as well as genotypic data. This leads to two specific challenges (a) semi-supervised learning --- a setting where many samples are available but only a small proportion of them are labeled and (b) missing data. We review a novel solution for the first setting: a novel dimensionality reduction framework that binds information theoretic considerations with geometrical constraints over the simplex. The dimensionality reduction framework is formulated to find optimal low dimensional geometric embedding of the simplex that preserves pairwise distances. This novel similarity-based clustering solution was tested on toy data and textual data. We show that this solution, although it outperforms other methods and provides good results on a small sample of the Euresist data, is impractical for the large EuResist dataset. In addition, we review a generative-discriminative prediction system that successfully overcomes the missing value challenge. Apart from a review of the EuResist project and related challenges, this chapter provides an overview of recent developments in the field of machine learning-based prediction methods for HIV drug resistance.