A sum-over-paths extension of edit distances accounting for all sequence alignments

Authors:
Silvia García-Díez;François Fouss;Masashi Shimbo;Marco Saerens
Affiliations:
Université de Louvain, ISYS, LSM, Louvain-la-Neuve, Belgium;Facultés Universitaires Catholiques de Mons, Management Science Department, LSM, Belgium;Graduate School of Information Science, Nara Institute of Science and Technology, Japan;Université de Louvain, ISYS, LSM, Louvain-la-Neuve, Belgium
Venue:
Pattern Recognition
Year:
2011

Citing 37
Cited 1

A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
Elements of information theory

Elements of information theory
Fundamentals of speech recognition

Fundamentals of speech recognition
String searching algorithms

String searching algorithms
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
The String-to-String Correction Problem

Journal of the ACM (JACM)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Proceedings of the sixth annual international conference on Computational biology
Hidden Markov Models for Speech Recognition

Hidden Markov Models for Speech Recognition
Introduction to Algorithms

Introduction to Algorithms
Computation of Normalized Edit Distance and Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence
Fast Computation of Normalized Edit Distances

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System

Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology
A new paradigm for ranking pages on the world wide web

WWW '03 Proceedings of the 12th international conference on World Wide Web
Text classification using string kernels

The Journal of Machine Learning Research
Word sequence kernels

The Journal of Machine Learning Research
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Fastest Mixing Markov Chain on a Graph

SIAM Review
Efficient Computation of Gapped Substring Kernels on Large Alphabets

The Journal of Machine Learning Research
Mismatch string kernels for discriminative protein classification

Bioinformatics
Protein homology detection using string alignment kernels

Bioinformatics
Pattern Recognition, Third Edition

Pattern Recognition, Third Edition
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Learning stochastic edit distance: Application in handwritten character recognition

Pattern Recognition
The Fastest Mixing Markov Process on a Graph and a Connection to a Maximum Variance Unfolding Problem

SIAM Review
Tuning continual exploration in reinforcement learning: An optimality property of the Boltzmann strategy

Neurocomputing
A family of dissimilarity measures between nodes generalizing both the shortest-path and the commute-time distances

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Randomized shortest-path problems: Two related models

Neural Computation
Learning to align: a statistical approach

IDA'07 Proceedings of the 7th international conference on Intelligent data analysis
The Sum-over-Paths Covariance Kernel: A Novel Covariance Measure between Nodes of a Directed Graph

IEEE Transactions on Pattern Analysis and Machine Intelligence
Graph Kernels

The Journal of Machine Learning Research
Optimal tuning of continual online exploration in reinforcement learning

ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
CONTRAlign: discriminative training for protein sequence alignment

RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition

IEEE Transactions on Information Theory
The entropy of Markov trajectories

IEEE Transactions on Information Theory

A new iterative algorithm for computing a quality approximate median of strings based on edit operations

Pattern Recognition Letters

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper introduces a simple Sum-over-Paths (SoP) formulation of string edit distances accounting for all possible alignments between two sequences, and extends related previous work from bioinformatics to the case of graphs with cycles. Each alignment @?, with a total cost C(@?), is assigned a probability of occurrence P(@?)=exp[-@qC(@?)]/Z where Z is a normalization factor. Therefore, good alignments (having a low cost) are favored over bad alignments (having a high cost). The expected cost @?"@?"@?"PC(@?)exp[-@qC(@?)]/Z computed over all possible alignments @?@?P defines the SoP edit distance. When @q-~, only the best alignments matter and the measure reduces to the standard edit distance. The rationale behind this definition is the following: for some applications, two sequences sharing many good alignments should be considered as more similar than two sequences having only one single good, optimal, alignment in common. In other words, sub-optimal alignments could also be taken into account. Forward/backward recurrences allowing to efficiently compute the expected cost are developed. Virtually any Viterbi-like sequence comparison algorithm computed on a lattice can be generalized in the same way; for instance, a SoP longest common subsequence is also developed. Pattern classification tasks performed on five data sets show that the new measures usually outperform the standard ones and, in any case, never perform significantly worse, at the expense of tuning the parameter @q.