Efficient dictionary-based text rewriting using subsequential transducers†

Authors:
S. Mihov;K. u. Schulz
Affiliations:
Institute for parallel processing, bulgarian academy of sciences, bulgaria e-mail: stoyan@lml.bas.bg;Cis, university of munich, munich, germany e-mail: schulz@cis.uni-muenchen.de
Venue:
Natural Language Engineering
Year:
2007

Citing 21
Cited 2

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Regular models of phonological rule systems

Computational Linguistics - Special issue on computational phonology
Deterministic part-of-speech tagging with finite-state transducers

Computational Linguistics
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Conceptual linking: ontology-based open hypermedia

Proceedings of the 10th international conference on World Wide Web
Automata and Computability

Automata and Computability
Finite-State Language Processing

Finite-State Language Processing
Introduction to Automata Theory, Languages and Computability

Introduction to Automata Theory, Languages and Computability
Data Structures and Algorithms

Data Structures and Algorithms
Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary?

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Incremental construction of minimal acyclic finite-state automata

Computational Linguistics - Special issue on finite-state methods in NLP
Finite-state transducers in language and speech processing

Computational Linguistics
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
On some applications of finite-state automata theory to natural language processing

Natural Language Engineering
Partial parsing via finite-state cascades

Natural Language Engineering
Transducers from rewrite rules with backreferences

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Directed replacement

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
An efficient compiler for weighted rewrite rules

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Example-Based Machine Translation in the Pangloss system

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Translation with cascaded finite state transducers

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics

Integrated document browsing and data acquisition for building large ontologies

KES'06 Proceedings of the 10th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part III
A note on sequential rule-based POS tagging

FSMNLP '11 Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Problems in the area of text and document processing can often be described as text rewriting tasks: given an input text, produce a new text by applying some fixed set of rewriting rules. In its simplest form, a rewriting rule is given by a pair of strings, representing a source string (the “original”) and its substitute. By a rewriting dictionary, we mean a finite list of such pairs; dictionary-based text rewriting means to replace in an input text occurrences of originals by their substitutes. We present an efficient method for constructing, given a rewriting dictionary D, a subsequential transducer that accepts any text t as input and outputs the intended rewriting result under the so-called “leftmost-longest match” replacement with skips, t'. The time needed to compute the transducer is linear in the size of the input dictionary. Given the transducer, any text t of length |t| is rewritten in a deterministic manner in time O(|t|+|t'|), where t' denotes the resulting output text. Hence the resulting rewriting mechanism is very efficient. As a second advantage, using standard tools, the transducer can be directly composed with other transducers to efficiently solve more complex rewriting tasks in a single processing step.