Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification

  • Authors:
  • Chu-Ren Huang;Petr Šimon;Shu-Kai Hsieh;Laurent Prévot

  • Affiliations:
  • Institute of Linguistics, Taiwan;Institute of Linguistics, Taiwan;DoFLAL, NIU, Taiwan;Université de Toulouse, France

  • Venue:
  • ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries (CB's) into either word-boundaries (WB's) and non-word-boundaries. In Chinese, CB's are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB's are WB's.