Dasher—a data entry interface using continuous gestures and language models
UIST '00 Proceedings of the 13th annual ACM symposium on User interface software and technology
Toward a unified approach to statistical language modeling for Chinese
ACM Transactions on Asian Language Information Processing (TALIP)
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A method to build a super small but practically accurate language model for handheld devices
Journal of Computer Science and Technology
Improving language model size reduction using better pruning criteria
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
A new statistical approach to Chinese Pinyin input
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese word segmentation as LMR tagging
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
An empirical study on language model adaptation
ACM Transactions on Asian Language Information Processing (TALIP)
Scaling conditional random fields using error-correcting codes
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Chinese segmentation and new word detection using conditional random fields
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Feedback arc set in bipartite tournaments is NP-complete
Information Processing Letters
A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition
ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02
Feedback arc set problem in bipartite tournaments
Information Processing Letters
Chinese pinyin phrasal input on mobile phone: usability and developing trends
Mobility '07 Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology
CrossNet: a framework for crossover with network-based chromosomal representations
Proceedings of the 10th annual conference on Genetic and evolutionary computation
TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
A Unified Character-Based Tagging Framework for Chinese Word Segmentation
ACM Transactions on Asian Language Information Processing (TALIP)
Integrating unsupervised and supervised word segmentation: The role of goodness measures
Information Sciences: an International Journal
A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Unsupervised segmentation of chinese corpus using accessor variety
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Hi-index | 0.00 |
Since a Chinese syllable can correspond to many characters (homophones), the syllable-to-character conversion task is quite challenging for Chinese phonetic input methods (CPIM). There are usually two stages in a CPIM: 1. segment the syllable sequence into syllable words, and 2. select the most likely character words for each syllable word. A CPIM usually assumes that the input is a complete sentence, and evaluates the performance based on a well-formed corpus. However, in practice, most Pinyin users prefer progressive text entry in several short chunks, mainly in one or two words each (most Chinese words consist of two or more characters). Short chunks do not provide enough contexts to perform the best possible syllable-to-character conversion, especially when a chunk consists of overlapping syllable words. In such cases, a conversion system often selects the boundary of a word with the highest frequency. Short chunk input is even more popular on platforms with limited computing power, such as mobile phones. Based on the observation that the relative strength of a word can be quite different when calculated leftwards or rightwards, we propose a simple division of the word context into the left context and the right context. Furthermore, we design a double ranking strategy for each word to reduce the number of errors in Step 1. Our strategy is modeled as the minimum feedback arc set problem on bipartite tournament with approximate solutions derived from genetic algorithm. Experiments show that, compared to the frequency-based method (FBM) (low memory and fast) and the conditional random fields (CRF) model (larger memory and slower), our double ranking strategy has the benefits of less memory and low power requirement with competitive performance. We believe a similar strategy could also be adopted to disambiguate conflicting linguistic patterns effectively.