Toward a unified approach to statistical language modeling for Chinese
ACM Transactions on Asian Language Information Processing (TALIP)
A segment-based hidden markov model for real-setting pinyin-to-chinese conversion
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Enhancing text clustering by leveraging Wikipedia semantics
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A Hybrid Approach to Word Segmentation of Vietnamese Texts
Language and Automata Theory and Applications
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Performance analysis for lattice-based speech indexing approaches using words and subword units
IEEE Transactions on Audio, Speech, and Language Processing
Domain-specific Chinese word segmentation using suffix tree and mutual information
Information Systems Frontiers
Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge
Proceedings of the 20th ACM international conference on Information and knowledge management
Role-explicit query identification and intent role annotation
Proceedings of the 21st ACM international conference on Information and knowledge management
Exploring the existing category hierarchy to automatically label the newly-arising topics in cQA
Proceedings of the 21st ACM international conference on Information and knowledge management
Improving question retrieval in community question answering using world knowledge
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Hi-index | 0.00 |
A Chinese word segmentation algorithm based on forward maximum matching and word binding force is proposed in this paper. This algorithm plays a key role in post-processing the output of a character or speech recognizer in determining the proper word sequence corresponding to an input line of character images or a speech waveform. To support this algorithm, a text corpus of over 63 millions characters is employed to enrich an 80,000-words lexicon in terms of its word entries and word binding forces. As it stands now, given an input line of text, the word segmentor can process on the average 210,000 characters per second when running on an IBM RISC System/6000 3BT workstation with a correct word identification rate of 99.74%.