Is the Protein Model Assignment problem under linked branch lengths NP-hard?

Authors:
Kassian Kobert;Jörg Hauser;Alexandros Stamatakis
Affiliations:
Heidelberg Institute for Theoretical Studies, Germany;Heidelberg Institute for Theoretical Studies, Germany;Heidelberg Institute for Theoretical Studies, Germany and Karlsruhe Institute of Technology, Institute for Theoretical Informatics, Postfach 6980, 76128 Karlsruhe, Germany
Venue:
Theoretical Computer Science
Year:
2014

Citing 5
Cited 0

The complexity of satisfiability problems

STOC '78 Proceedings of the tenth annual ACM symposium on Theory of computing
The complexity of theorem-proving procedures

STOC '71 Proceedings of the third annual ACM symposium on Theory of computing
A Short Proof that Phylogenetic Tree Reconstruction by Maximum Likelihood Is Hard

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Finding a maximum likelihood tree is hard

Journal of the ACM (JACM)
RAxML-Light

Bioinformatics

Quantified Score

Hi-index	5.23

Visualization

Abstract

In phylogenetics, computing the likelihood that a given tree generated the observed sequence data requires calculating the probability of the available data for a given tree (topology and branch lengths) under a statistical model of sequence evolution. Here, we focus on selecting an appropriate model for the data, which represents a generally non-trivial task. The data is represented as a so-called multiple sequence alignment. That is, each individual sequence of any one species (taxa) is arranged (aligned) in such a way, that the characters of all species at a given position (site) are assumed to share a common evolutionary history. It is well known, that an inappropriate model, which does not fit the data, can generate misleading tree topologies [3,4,26]. More specifically, we consider the case of partitioned protein sequence alignments. This means that the sites of the alignment may be clustered together into different partitions. Each partition may have an individual model of evolution. Our objective is to maximize the likelihood of the per-partition protein model assignments (e.g., JTT, WAG, etc.) when branches are linked across partitions on a given, fixed tree topology. That is, branch lengths are not estimated individually for each partition. Linked branch lengths across partitions substantially reduce the number of free parameters. For p partitions and |M| possible substitution models, there are |M|^p possible model assignments. Since the number of combinations grows exponentially with p, an exhaustive search for the highest scoring assignment is computationally prohibitive for |M|1. We show that the problem of finding the optimal protein substitution model assignment under linked branch lengths on a given, tree topology, is NP-hard. Our results imply that one should employ heuristics to approximate the solution, instead of striving for the exact solution. Alternatively, the problem can be simplified by relaxing the assumptions.