Tokenization as the initial phase in NLP

  • Authors:
  • Jonathan J. Webster;Chunyu Kit

  • Affiliations:
  • City Polytechnic of Hong Kong, Kowloon, Hong Kong;City Polytechnic of Hong Kong, Kowloon, Hong Kong

  • Venue:
  • COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
  • Year:
  • 1992

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation, respectively. Automatic segmentation of Chinese words is presented as an illustration of tokenization. Practical approaches to identification of compound tokens in English, such as idioms, phrasal verbs and fixed expressions, are developed.