A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
Critical tokenization and its properties
Computational Linguistics
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Phrase-based approach for adaptive tokenization
SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
Hi-index | 0.01 |
In our study, sentences are represented as sequences of critical fragments, and critical fragments with more than one distinct resolution found in the training corpus are considered as being ambiguous. Different from other studies, the ambiguous critical fragments are disambiguated using an example-based system in our study. The contexts, i.e. the adjacent characters, words and critical fragments, on either side of an ambiguous critical fragment, are used to measure the distance between training and testing examples. Two kinds of measures, overlap metric and chi-squared feature weighting, are employed, and our system achieves a precision of 93.65% and a recall of 96.56% in the open test.