Learning String-Edit Distance

Authors:
Eric Sven Ristad;Peter N. Yianilos
Affiliations:
Mnemonic Technology, Inc., Princeton, NJ;NEC Research Institute, Princeton, NJ
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
1998

Citing 8
Cited 124

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Topics in computational hidden state modeling

Topics in computational hidden state modeling
The String-to-String Correction Problem

Journal of the ACM (JACM)
Computer programs for detecting and correcting spelling errors

Communications of the ACM
Computation of Normalized Edit Distance and Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning String Edit Distance

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Optimal and Information Theoretic Syntactic Pattern Recognition for Traditional Errors

SSPR '96 Proceedings of the 6th International Workshop on Advances in Structural and Syntactical Pattern Recognition
Hierarchical non-emitting Markov models

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics

Improved string matching under noisy channel conditions

Proceedings of the tenth international conference on Information and knowledge management
Language Simplification through Error-Correcting and Grammatical Inference Techniques

Machine Learning
Evidence Accumulation Clustering Based on the K-Means Algorithm

Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Partitional vs Hierarchical Clustering Using a Minimum Grammar Complexity Approach

Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition
A New Cluster Isolation Criterion Based on Dissimilarity Increments

IEEE Transactions on Pattern Analysis and Machine Intelligence
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Grouping search-engine returned citations for person-name queries

Proceedings of the 6th annual ACM international workshop on Web information and data management
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Improving the performance of dictionary-based approaches in protein name recognition

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Multipath translation lexicon induction via bridge languages

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Relational clustering for multi-type entity resolution

MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Backward machine transliteration by learning phonetic similarity

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Quality enhancement in information extraction from scanned documents

Proceedings of the 2006 ACM symposium on Document engineering
An approximate multi-word matching algorithm for robust document retrieval

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
A dynamic Bayesian framework to model context and memory in edit distance learning: an application to pronunciation classification

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Learning a spelling error model from search query logs

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Learning stochastic edit distance: Application in handwritten character recognition

Pattern Recognition
The usability of passphrases for authentication: An empirical field study

International Journal of Human-Computer Studies
OCR error correction using a noisy channel model

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Case-factor diagrams for structured probabilistic modeling

Journal of Computer and System Sciences
A strategy for allowing meaningful and comparable scores in approximate matching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Robust symbolic representation for shape recognition and retrieval

Pattern Recognition
Robust symbolic representation for shape recognition and retrieval

Pattern Recognition
Learning probabilistic models of tree edit distance

Pattern Recognition
English-Arabic proper-noun transliteration-pairs creation

Journal of the American Society for Information Science and Technology
Learning Metrics Between Tree Structured Data: Application to Image Recognition

ECML '07 Proceedings of the 18th European conference on Machine Learning
SEDiL: Software for Edit Distance Learning

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Similarity of Names Across Scripts: Edit Distance Using Learned Costs of N-Grams

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Melody Recognition with Learned Edit Distances

SSPR & SPR '08 Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
A Stochastic Approach to Median String Computation

SSPR & SPR '08 Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Ordering the suggestions of a spellchecker without using context*

Natural Language Engineering
Fast error-tolerant search on very large texts

Proceedings of the 2009 ACM symposium on Applied Computing
Generalized Mongue-Elkan Method for Approximate Text String Comparison

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
CLHQS: Hierarchical Query Suggestion by Mining Clickthrough Log

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Trajectory representation using Gabor features for motion-based video retrieval

Pattern Recognition Letters
Shape and texture clustering: Best estimate for the clusters number

Image and Vision Computing
Adaptive string distance measures for bilingual dialect lexicon induction

ACL '07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop
Phrase-based correction model for improving handwriting recognition accuracies

Pattern Recognition
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
Learnable similarity functions and their applications to clustering and record linkage

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Induction of cross-language affix and letter sequence correspondence

CrossLangInduction '06 Proceedings of the International Workshop on Cross-Language Knowledge Induction
Learning to match names across languages

MMIES '08 Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization
Latent-variable modeling of string transductions with finite-state methods

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Abstractions in Process Mining: A Taxonomy of Patterns

BPM '09 Proceedings of the 7th International Conference on Business Process Management
Robust understanding in multimodal interfaces

Computational Linguistics
Unsupervised constraint driven learning for transliteration discovery

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Evaluation of several phonetic similarity algorithms on the task of cognate identification

LD '06 Proceedings of the Workshop on Linguistic Distances
Evaluating the pairwise string alignment of pronunciations

LaTeCH-SHELT&R '09 Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education
Effective spelling correction in web queries and run-time DB construction

Proceedings of the 2009 International Conference on Hybrid Information Technology
A global model for joint lemmatization and part-of-speech prediction

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Discriminative substring decoding for transliteration

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Modeling machine transliteration as a phrase based statistical machine translation problem

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Computing word similarity and identifying cognates with pair hidden Markov models

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Learning state machine-based string edit kernels

Pattern Recognition
Edit-distance of weighted automata

CIAA'02 Proceedings of the 7th international conference on Implementation and application of automata
Support vector training of protein alignment models

RECOMB'07 Proceedings of the 11th annual international conference on Research in computational molecular biology
Graph-based tools for data mining and machine learning

MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition
A no-word-segmentation hierarchical clustering approach to Chinese web search results

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Sentence similarity measure based on events and content words

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
A workflow net similarity measure based on transition adjacency relations

Computers in Industry
Detecting duplicate biological entities using Shortest Path Edit Distance

International Journal of Data Mining and Bioinformatics
Linear frequency estimation technique for reducing frequency based signals

Proceedings of the 3rd International Conference on PErvasive Technologies Related to Assistive Environments
Shape recognition based on Kernel-edit distance

Computer Vision and Image Understanding
Finding similar failures using callstack similarity

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Data-driven computational linguistics at FaMAF-UNC, Argentina

YIWCALA '10 Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
Transliteration generation and mining with limited training resources

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Semantic and phonetic automatic reconstruction of medical dictations

Computer Speech and Language
Efficient duplicate record detection based on similarity estimation

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Weighted symbols-based edit distance for string-structured image classification

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
Automatically extracting information needs from complex clinical questions

Journal of Biomedical Informatics
A sum-over-paths extension of edit distances accounting for all sequence alignments

Pattern Recognition
Schema mapping with quality assurance for data integration

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
A fast and accurate method for approximate string search

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
How do you pronounce your name?: improving G2P with transliterations

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Why press backspace?: understanding user input behaviors in Chinese Pinyin input method

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Finding unexpected navigation behaviour in clickstream data for website design improvement

Journal of Web Engineering
Discovering context: classifying tweets through a semantic transform based on wikipedia

FAC'11 Proceedings of the 6th international conference on Foundations of augmented cognition: directing the future of adaptive systems
A system for adaptive information extraction from highly informal text

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Learning good edit similarities with generalization guarantees

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
Adjusting Fuzzy Similarity Functions for use with standard data mining tools

Journal of Systems and Software
On the usefulness of similarity based projection spaces for transfer learning

SIMBAD'11 Proceedings of the First international conference on Similarity-based pattern recognition
Unsupervised multilingual learning

Unsupervised multilingual learning
Levenshtein distances fail to identify language relationships accurately

Computational Linguistics
Edit distance for ordered vector sets: a case of study

SSPR'06/SPR'06 Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Using learned conditional distributions as edit distance

SSPR'06/SPR'06 Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Learning stochastic tree edit distance

ECML'06 Proceedings of the 17th European conference on Machine Learning
Graph matching – challenges and potential solutions

ICIAP'05 Proceedings of the 13th international conference on Image Analysis and Processing
Probabilistic iterative duplicate detection

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
A discriminative model of stochastic edit distance in the form of a conditional transducer

ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Scoring matrices that induce metrics on sequences

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
Building a term suggestion and ranking system based on a probabilistic analysis model and a semantic analysis graph

Decision Support Systems
HMM-based ball hitting event exploration system for broadcast baseball video

Journal of Visual Communication and Image Representation
CHIME: an efficient error-tolerant Cinese pinyin input method

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Performance debugging in the large via mining millions of stack traces

Proceedings of the 34th International Conference on Software Engineering
Outline matching of the 2d shapes using extracting XML data

ICISP'12 Proceedings of the 5th international conference on Image and Signal Processing
Entity resolution: theory, practice & open challenges

Proceedings of the VLDB Endowment
Active learning strategies for the deduplication of electronic patient data using classification trees

Journal of Biomedical Informatics
Character-based pivot translation for under-resourced languages and domains

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Leveraging supplemental representations for sequential transduction

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Soft cardinality: a parameterized similarity function for text comparison

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Soft cardinality + ML: learning adaptive similarity functions for cross-lingual textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Improving statistical machine translation for a resource-poor language using related resource-rich languages

Journal of Artificial Intelligence Research
Discriminative pronunciation modeling: a large-margin, feature-rich approach

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Combining word-level and character-level models for machine translation between closely-related languages

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Source language adaptation for resource-poor machine translation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Name phylogeny: a generative model of string variation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Alignment-HMM-based extraction of abbreviations from biomedical text

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Cost-benefit analysis of two-stage conditional random fields based English-to-Chinese machine transliteration

NEWS '12 Proceedings of the 4th Named Entity Workshop
Social issue gives you an opportunity: discovering the personalised relevance of social issues

PKAW'12 Proceedings of the 12th Pacific Rim conference on Knowledge Management and Acquisition for Intelligent Systems
Trying to outperform a well-known index with a sequential scan

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Generating service models by trace subsequence substitution

Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures
Model words-driven approaches for duplicate detection on the web

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Query representation for cross-temporal information retrieval

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A Bayesian Alignment Approach to Transliteration Mining

ACM Transactions on Asian Language Information Processing (TALIP)
A new iterative algorithm for computing a quality approximate median of strings based on edit operations

Pattern Recognition Letters
Deduplicating a places database

Proceedings of the 23rd international conference on World wide web
Towards a Protein-Protein Interaction information extraction system: Recognizing named entities

Knowledge-Based Systems
A framework for evaluating semantic annotations of Web services: A network theory based approach for measuring annotation quality

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.14

Visualization

Abstract

In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: The minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.