Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation

Authors:
Ting-hao Yang;Tian-Jian Jiang;Chan-hung Kuo;Richard Tzong-han Tsai;Wen-lian Hsu
Affiliations:
Institute of Information Science, Academia Sinica;National Tsing-Hua University;Institute of Information Science, Academia Sinica;Yuan Ze University;Institute of Information Science, Academia Sinica
Venue:
ROCLING '11 Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing
Year:
2011

Citing 12
Cited 0

PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Accessor variety criteria for Chinese word extraction

Computational Linguistics
Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Subword-based tagging for confidence-dependent Chinese word segmentation

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Compute the Term Contributed Frequency

ISDA '08 Proceedings of the 2008 Eighth International Conference on Intelligent Systems Design and Applications - Volume 02
Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures

Expert Systems with Applications: An International Journal
A Unified Character-Based Tagging Framework for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Statistical substring reduction in linear time

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work represents several unsupervised feature selections based on frequent strings that help improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based N-gram (CNG), Accessor Variety based string (AVS), and Term Contributed Frequency (TCF) with a specific manner of boundary overlapping. For the experiment, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS; and the data set is acquired from SIGHAN CWS bakeoff 2005. The experiment results show that all of those features improve our system's F1 measure (F) and Recall of Out-of-Vocabulary (ROOV). In particular, the feature collections which contain AVS feature outperform other types of features in terms of F, whereas the feature collections containing TCB/TCF information has better ROOV.