ACTS: an automatic Chinese text segmentation system for full text retrieval
Journal of the American Society for Information Science
An intelligent full-text Chinese-English translation system
Information Sciences—Applications: An International Journal
Shallow parsing using specialized hmms
The Journal of Machine Learning Research
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Chinese word segmentation and its effect on information retrieval
Information Processing and Management: an International Journal
A trainable rule-based algorithm for word segmentation
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Lexicalized hidden Markov models for part-of-speech tagging
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Tokenization as the initial phase in NLP
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
A Chinese word segmentation based on language situation in processing ambiguous words
Information Sciences: an International Journal
Chinese named entity recognition using lexicalized HMMs
ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach
Computational Linguistics
Design of Chinese morphological analyzer
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Automatic recognition of Chinese unknown words based on roles tagging
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
The first international Chinese word segmentation Bakeoff
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese segmentation and new word detection using conditional random fields
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Subword-based tagging for confidence-dependent Chinese word segmentation
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Semantic passage segmentation based on sentence topics for question answering
Information Sciences: an International Journal
Minimum tag error for discriminative training of conditional random fields
Information Sciences: an International Journal
Integrating unsupervised and supervised word segmentation: The role of goodness measures
Information Sciences: an International Journal
Chinese sentence-level sentiment classification based on fuzzy sets
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Information Sciences: an International Journal
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation
ACM Transactions on Asian Language Information Processing (TALIP)
Hi-index | 0.07 |
Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as tagging units. In this paper we present a morpheme-based chunking approach and implement it in a two-stage system. It consists of two main components, namely a morpheme segmentation component to segment an input sentence to a sequence of morphemes based on morpheme-formation models and bigram language models, and a lexical chunking component to label each segmented morpheme's position in a word of a special type with the aid of lexicalized hidden Markov models. To facilitate these tasks, a statistically-based technique is also developed for automatically compiling a morpheme dictionary from a segmented or tagged corpus. To evaluate this approach, we conduct a closed test and an open test using the 2005 SIGHAN Bakeoff data. Our system demonstrates state-of-the-art performance on different test sets, showing the benefits of choosing morphemes as tagging units. Furthermore, the open test results indicate significant performance enhancement using lexicalization and part-of-speech features.