Is the Protein Model Assignment problem under linked branch lengths NP-hard?

  • Authors:
  • Kassian Kobert;Jörg Hauser;Alexandros Stamatakis

  • Affiliations:
  • Heidelberg Institute for Theoretical Studies, Germany;Heidelberg Institute for Theoretical Studies, Germany;Heidelberg Institute for Theoretical Studies, Germany and Karlsruhe Institute of Technology, Institute for Theoretical Informatics, Postfach 6980, 76128 Karlsruhe, Germany

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2014

Quantified Score

Hi-index 5.23

Visualization

Abstract

In phylogenetics, computing the likelihood that a given tree generated the observed sequence data requires calculating the probability of the available data for a given tree (topology and branch lengths) under a statistical model of sequence evolution. Here, we focus on selecting an appropriate model for the data, which represents a generally non-trivial task. The data is represented as a so-called multiple sequence alignment. That is, each individual sequence of any one species (taxa) is arranged (aligned) in such a way, that the characters of all species at a given position (site) are assumed to share a common evolutionary history. It is well known, that an inappropriate model, which does not fit the data, can generate misleading tree topologies [3,4,26]. More specifically, we consider the case of partitioned protein sequence alignments. This means that the sites of the alignment may be clustered together into different partitions. Each partition may have an individual model of evolution. Our objective is to maximize the likelihood of the per-partition protein model assignments (e.g., JTT, WAG, etc.) when branches are linked across partitions on a given, fixed tree topology. That is, branch lengths are not estimated individually for each partition. Linked branch lengths across partitions substantially reduce the number of free parameters. For p partitions and |M| possible substitution models, there are |M|^p possible model assignments. Since the number of combinations grows exponentially with p, an exhaustive search for the highest scoring assignment is computationally prohibitive for |M|1. We show that the problem of finding the optimal protein substitution model assignment under linked branch lengths on a given, tree topology, is NP-hard. Our results imply that one should employ heuristics to approximate the solution, instead of striving for the exact solution. Alternatively, the problem can be simplified by relaxing the assumptions.