Efficient dictionary-based text rewriting using subsequential transducers†

  • Authors:
  • S. Mihov;K. u. Schulz

  • Affiliations:
  • Institute for parallel processing, bulgarian academy of sciences, bulgaria e-mail: stoyan@lml.bas.bg;Cis, university of munich, munich, germany e-mail: schulz@cis.uni-muenchen.de

  • Venue:
  • Natural Language Engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Problems in the area of text and document processing can often be described as text rewriting tasks: given an input text, produce a new text by applying some fixed set of rewriting rules. In its simplest form, a rewriting rule is given by a pair of strings, representing a source string (the “original”) and its substitute. By a rewriting dictionary, we mean a finite list of such pairs; dictionary-based text rewriting means to replace in an input text occurrences of originals by their substitutes. We present an efficient method for constructing, given a rewriting dictionary D, a subsequential transducer that accepts any text t as input and outputs the intended rewriting result under the so-called “leftmost-longest match” replacement with skips, t'. The time needed to compute the transducer is linear in the size of the input dictionary. Given the transducer, any text t of length |t| is rewritten in a deterministic manner in time O(|t|+|t'|), where t' denotes the resulting output text. Hence the resulting rewriting mechanism is very efficient. As a second advantage, using standard tools, the transducer can be directly composed with other transducers to efficiently solve more complex rewriting tasks in a single processing step.