HIV-1 Drug Resistance Prediction and Therapy Optimization: A Case Study for the Application of Classification and Clustering Methods

Authors:
Michal Rosen-Zvi;Ehud Aharoni;Joachim Selbig
Affiliations:
IBM Research Laboratory in Haifa, Haifa University, Haifa, Israel 31905;IBM Research Laboratory in Haifa, Haifa University, Haifa, Israel 31905;Institute of Biochemistry and Biology, Max Planck Institute of Molecular Plant Physiology, University of Potsdam, Potsdam-Golm, Germany D-14476
Venue:
Similarity-Based Clustering
Year:
2009

Citing 7
Cited 0

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Elements of information theory

Elements of information theory
Diffusion Kernels on Statistical Manifolds

The Journal of Machine Learning Research
Mining complex genotypic features for predicting HIV-1 drug resistance

Bioinformatics
Selecting anti-HIV therapies based on a variety of genomic and clinical factors

Bioinformatics
Arevir: a secure platform for designing personalized antiretroviral therapies against HIV

DILS'06 Proceedings of the Third international conference on Data Integration in the Life Sciences
A new metric for probability distributions

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

This chapter provides a review of the challenges machine-learning specialists face when trying to assist virologists by generating an automatic prediction of an outcome of HIV therapy. Optimizing HIV therapies is crucial since the virus rapidly develops mutations to evade drug pressures. Modern anti-HIV regimens comprise multiple drugs in order to prevent, or at least delay, the development of resistance mutations. In recent years, large databases have been collected to allow the automatic analysis of relations between the virus genome other clinical and demographical information, and the failure or success of a therapy. The EuResist integrated database (EID) collected from about 18500 patients and 65000 different therapies is probably one of the largest clinical genomic databases. Only one third of the therapies in the EID contain therapy response information and only 5% of the therapy records have response information as well as genotypic data. This leads to two specific challenges (a) semi-supervised learning --- a setting where many samples are available but only a small proportion of them are labeled and (b) missing data. We review a novel solution for the first setting: a novel dimensionality reduction framework that binds information theoretic considerations with geometrical constraints over the simplex. The dimensionality reduction framework is formulated to find optimal low dimensional geometric embedding of the simplex that preserves pairwise distances. This novel similarity-based clustering solution was tested on toy data and textual data. We show that this solution, although it outperforms other methods and provides good results on a small sample of the Euresist data, is impractical for the large EuResist dataset. In addition, we review a generative-discriminative prediction system that successfully overcomes the missing value challenge. Apart from a review of the EuResist project and related challenges, this chapter provides an overview of recent developments in the field of machine learning-based prediction methods for HIV drug resistance.