Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation

  • Authors:
  • Ting-hao Yang;Tian-Jian Jiang;Chan-hung Kuo;Richard Tzong-han Tsai;Wen-lian Hsu

  • Affiliations:
  • Institute of Information Science, Academia Sinica;National Tsing-Hua University;Institute of Information Science, Academia Sinica;Yuan Ze University;Institute of Information Science, Academia Sinica

  • Venue:
  • ROCLING '11 Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This work represents several unsupervised feature selections based on frequent strings that help improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based N-gram (CNG), Accessor Variety based string (AVS), and Term Contributed Frequency (TCF) with a specific manner of boundary overlapping. For the experiment, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS; and the data set is acquired from SIGHAN CWS bakeoff 2005. The experiment results show that all of those features improve our system's F1 measure (F) and Recall of Out-of-Vocabulary (ROOV). In particular, the feature collections which contain AVS feature outperform other types of features in terms of F, whereas the feature collections containing TCB/TCF information has better ROOV.