Critical tokenization and its properties

  • Authors:
  • Jin Guo

  • Affiliations:
  • National University of Singapore

  • Venue:
  • Computational Linguistics
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

Tokenization is the process of mapping sentences from character strings into strings of words. This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding.The main results are as follows: (1) Critical points are all and only unambiguous token boundaries for any character string on a complete dictionary; (2)Any critically tokenized word string is a minimal element in the partially ordered set of all tokenized word strings with respect to the word string cover relation; (3) Any tokenized string can be reproduced from a critically tokenized word string but not vice versa; (4) Critical tokenization forms the sound mathematical foundation for categorizing tokenization ambiguity into critical and hidden types, a precise mathematical understanding of conventional concepts like combinational and overlapping ambiguities; (5) Many important maximum tokenization variations, such as forward and backward maximum matching and shortest tokenization, are all true subclasses of critical tokenization.It is believed that critical tokenization provides a precise mathematical description of the principle of maximum tokenization. Important implications and practical applications of critical tokenization in effective ambiguity resolution and in efficient tokenization implementation are also carefully examined.