A discriminative candidate generator for string transformations

Authors:
Naoaki Okazaki;Yoshimasa Tsuruoka;Sophia Ananiadou;Jun'ichi Tsujii
Affiliations:
University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan;University of Manchester, Manchester Interdisciplinary Biocentre, Manchester, UK;University of Manchester, Manchester Interdisciplinary Biocentre, Manchester, UK;University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan and University of Manchester, Manchester Interdisciplinary Biocentre, Manchester, UK
Venue:
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2008

Citing 15
Cited 7

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
A maximum entropy approach to natural language processing

Computational Linguistics
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Bitext maps and alignment via pattern recognition

Computational Linguistics
Applied morphological processing of English

Natural Language Engineering
Feature selection, L1 vs. L2 regularization, and rotational invariance

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Measures of distributional similarity

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Spelling correction in the PubMed search engine

Information Retrieval
Exploring distributional similarity based models for query spelling correction

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Learning a spelling error model from search query logs

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Scalable training of L1-regularized log-linear models

Proceedings of the 24th international conference on Machine learning
OCR error correction using a noisy channel model

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Memory-Based Context-Sensitive Spelling Correction at Web Scale

ICMLA '07 Proceedings of the Sixth International Conference on Machine Learning and Applications

Discovery of term variation in Japanese web search queries

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Learning phrase-based spelling error models from clickthrough data

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Robust measurement and comparison of context similarity for finding translation pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Effective use of dependency structure for bilingual lexicon creation

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
A fast and accurate method for approximate string search

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Learning the optimal use of dependency-parsing information for finding translations with comparable corpora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

String transformation, which maps a source string s into its desirable form t*, is related to various applications including stemming, lemmatization, and spelling correction. The essential and important step for string transformation is to generate candidates to which the given string s is likely to be transformed. This paper presents a discriminative approach for generating candidate strings. We use substring substitution rules as features and score them using an L1-regularized logistic regression model. We also propose a procedure to generate negative instances that affect the decision boundary of the model. The advantage of this approach is that candidate strings can be enumerated by an efficient algorithm because the processes of string transformation are tractable in the model. We demonstrate the remarkable performance of the proposed method in normalizing inflected words and spelling variations.