Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese
ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria
ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
Building Parallel Corpora by Automatic Title Alignment
ICADL '02 Proceedings of the 5th International Conference on Asian Digital Libraries: Digital Libraries: People, Knowledge, and Technology
A comparison of Chinese document indexing strategies and retrieval models
ACM Transactions on Asian Language Information Processing (TALIP)
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora
Computational Linguistics
A trainable rule-based algorithm for word segmentation
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Machine translation with a stochastic grammatical channel
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
An algorithm for simultaneously bracketing parallel texts by aligning words
ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
A polynomial-time algorithm for statistical machine translation
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A maximum-entropy chinese parser augmented by transformation-based learning
ACM Transactions on Asian Language Information Processing (TALIP)
Multidimensional transformation-based learning
ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
A maximum entropy Chinese character-based parser
EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Applications of corpus-based semantic similarity and word segmentation to database schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Large-scale language modeling with random forests for mandarin Chinese speech-to-text
IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
A new re-ranking method for generic chinese text summarization and its evaluation
ICADL'05 Proceedings of the 8th international conference on Asian Digital Libraries: implementing strategies and sharing experiences
A new method to compose long unknown Chinese keywords
Journal of Information Science
Hi-index | 0.00 |
The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been based on dictionary lookup (e.g., Chang & Chen 1993; Chiang et al. 1992; Lin et al. 1993; Wu & Tseng 1993; Sproat et al. 1994).We present empirical evidence for four points concerning tokenization of Chinese text: (1) More rigorous "blind" evaluation methodology is needed to avoid inflated accuracy measurements; we introduce the nk-blind method. (2) The extent of the unknown-word problem is far more serious than generally thought, when tokenizing unrestricted texts in realistic domains. (3) Statistical lexical acquisition is a practical means to greatly improve tokenization accuracy with unknown words, reducing error rates as much as 32.0%. (4) When augmenting the lexicon, linguistic constraints can provide simple inexpensive filters yielding significantly better precision, reducing error rates as much as 49.4%.