An example-based study on chinese word segmentation using critical fragments

Authors:
Qinan Hu;Haihua Pan;Chunyu Kit
Affiliations:
Department of Chinese, Translation and Linguistics, City University of Hong Kong, Hong Kong;Department of Chinese, Translation and Linguistics, City University of Hong Kong, Hong Kong;Department of Chinese, Translation and Linguistics, City University of Hong Kong, Hong Kong
Venue:
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Year:
2004

Citing 7
Cited 1

Technical Note: Bias in Information-Based Measures in Decision Tree Induction

Machine Learning
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Critical tokenization and its properties

Computational Linguistics
One tokenization per source

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17

Phrase-based approach for adaptive tokenization

SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology

Quantified Score

Hi-index	0.01

Visualization

Abstract

In our study, sentences are represented as sequences of critical fragments, and critical fragments with more than one distinct resolution found in the training corpus are considered as being ambiguous. Different from other studies, the ambiguous critical fragments are disambiguated using an example-based system in our study. The contexts, i.e. the adjacent characters, words and critical fragments, on either side of an ambiguous critical fragment, are used to measure the distance between training and testing examples. Two kinds of measures, overlap metric and chi-squared feature weighting, are employed, and our system achieves a precision of 93.65% and a recall of 96.56% in the open test.