A Stochastic Approach to Median String Computation

Authors:
Cristian Olivares-Rodríguez;Jose Oncina
Affiliations:
Departamento de lenguajes y sistemas informáticos, Universidad de Alicante,;Departamento de lenguajes y sistemas informáticos, Universidad de Alicante,
Venue:
SSPR & SPR '08 Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Year:
2008

Citing 6
Cited 2

Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Topology of strings: median string is NP-complete

Theoretical Computer Science
Median strings for k-nearest neighbour classification

Pattern Recognition Letters
Reducing the Computational Cost of Computing Approximated Median Strings

Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Comparison of Four Initialization Techniques for the K -Medians Clustering Algorithm

Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition
Learning stochastic edit distance: Application in handwritten character recognition

Pattern Recognition

Generalized median string computation by means of string embedding in vector spaces

Pattern Recognition Letters
A new iterative algorithm for computing a quality approximate median of strings based on edit operations

Pattern Recognition Letters

Quantified Score

Hi-index	0.01

Visualization

Abstract

Due to its robustness to outliers, many Pattern Recognition algorithms use the median as a representative of a set of points. A special case arises in Syntactical Pattern Recognition when the points (prototypes) are represented by strings. However, when the edit distance is used, finding the median becomes a NP-Hard problem. Then, either the search is restricted to strings in the data (set-median ) or some heuristic approach is applied. In this work we use the (conditional) stochastic edit distance instead of the plain edit distance. It is not yet known if in this case the problem is also NP-Hard so an approximation algorithm is described. The algorithm is based on the extension of the string structure to multistrings (strings of stochastic vectors where each element represents the probability of each symbol) to allow the use of the Expectation Maximization technique. We carry out some experiments over a chromosomes corpus to check the efficiency of the algorithm.