The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

Authors:
Mike Tian-Jian Jiang;Tsung-Hsien Lee;Wen-Lian Hsu
Affiliations:
National Tsing Hua University and Academia Sinica;Academia Sinica and University of Texas at Austin;Academia Sinica and National Tsing Hua University
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2013

Citing 22
Cited 0

Dasher—a data entry interface using continuous gestures and language models

UIST '00 Proceedings of the 13th annual ACM symposium on User interface software and technology
Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing (TALIP)
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A method to build a super small but practically accurate language model for handheld devices

Journal of Computer Science and Technology
Improving language model size reduction using better pruning criteria

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
A new statistical approach to Chinese Pinyin input

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese word segmentation as LMR tagging

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
An empirical study on language model adaptation

ACM Transactions on Asian Language Information Processing (TALIP)
Scaling conditional random fields using error-correcting codes

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Feedback arc set in bipartite tournaments is NP-complete

Information Processing Letters
A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02
Feedback arc set problem in bipartite tournaments

Information Processing Letters
Chinese pinyin phrasal input on mobile phone: usability and developing trends

Mobility '07 Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology
CrossNet: a framework for crossover with network-based chromosomal representations

Proceedings of the 10th annual conference on Genetic and evolutionary computation
Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
A Unified Character-Based Tagging Framework for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Since a Chinese syllable can correspond to many characters (homophones), the syllable-to-character conversion task is quite challenging for Chinese phonetic input methods (CPIM). There are usually two stages in a CPIM: 1. segment the syllable sequence into syllable words, and 2. select the most likely character words for each syllable word. A CPIM usually assumes that the input is a complete sentence, and evaluates the performance based on a well-formed corpus. However, in practice, most Pinyin users prefer progressive text entry in several short chunks, mainly in one or two words each (most Chinese words consist of two or more characters). Short chunks do not provide enough contexts to perform the best possible syllable-to-character conversion, especially when a chunk consists of overlapping syllable words. In such cases, a conversion system often selects the boundary of a word with the highest frequency. Short chunk input is even more popular on platforms with limited computing power, such as mobile phones. Based on the observation that the relative strength of a word can be quite different when calculated leftwards or rightwards, we propose a simple division of the word context into the left context and the right context. Furthermore, we design a double ranking strategy for each word to reduce the number of errors in Step 1. Our strategy is modeled as the minimum feedback arc set problem on bipartite tournament with approximate solutions derived from genetic algorithm. Experiments show that, compared to the frequency-based method (FBM) (low memory and fast) and the conditional random fields (CRF) model (larger memory and slower), our double ranking strategy has the benefits of less memory and low power requirement with competitive performance. We believe a similar strategy could also be adopted to disambiguate conflicting linguistic patterns effectively.