Sequence-based prediction of HIV-1 coreceptor usage: utility of n-grams for representing gp120 V3 loops

  • Authors:
  • Majid Masso

  • Affiliations:
  • George Mason University, Manassas, Virginia

  • Venue:
  • Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Human immunodeficiency virus type 1 (HIV-1) targets for infection host cells that express both the CD4 surface membrane receptor, which binds the viral envelope glycoprotein gp120, as well as either the CCR5 (R5) or CXCR4 (X4) chemokine coreceptor, which principally interact with the V3 loop region of gp120. Coreceptor selectivity, or tropism, is dependent upon the sequence patterns encoding HIV-1 viral strains, and there are medications currently on the market and in development designed to bind and inhibit each coreceptor. Since determination of HIV-1 coreceptor usage must be undertaken prior to administering such a drug, and given the costly and time-consuming nature of experimental assays in this regard, there is now considerable interest in direct application of machine learning algorithms for classifying HIV-1 coreceptor usage based on the V3 loop region of gp120. Here for the first time, a number of n-grams (subsequences formed by a sliding window of size n) approaches are described for representing as feature vectors two large datasets of V3 loop peptide sequences obtained from HIV-1 viruses with known coreceptor usage, and the random forest algorithm is implemented for classification. These datasets were previously retrieved and used to develop combined sequence-structure based classifiers as well as sequence based string kernel classifiers, respectively. A comparison of the accuracy reported for those complex classifiers with the performance achieved here using relatively simpler and more computationally efficient n-grams reveals significant advantages while highlighting limitations.