Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

Authors:
Dekai Wu;Pascale Fung
Affiliations:
University of Science & Technology (HKUST), Clear Water Bay, Hong Kong;Columbia University, New York, NY
Venue:
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Year:
1994

Citing 3
Cited 19

Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics

A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language Groups

Machine Translation
Building Parallel Corpora by Automatic Title Alignment

ICADL '02 Proceedings of the 5th International Conference on Asian Digital Libraries: Digital Libraries: People, Knowledge, and Technology
A comparison of Chinese document indexing strategies and retrieval models

ACM Transactions on Asian Language Information Processing (TALIP)
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora

Computational Linguistics
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Machine translation with a stochastic grammatical channel

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
An algorithm for simultaneously bracketing parallel texts by aligning words

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
A polynomial-time algorithm for statistical machine translation

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A maximum-entropy chinese parser augmented by transformation-based learning

ACM Transactions on Asian Language Information Processing (TALIP)
Multidimensional transformation-based learning

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
A maximum entropy Chinese character-based parser

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Applications of corpus-based semantic similarity and word segmentation to database schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Stochastic inversion transduction grammars with application to segmentation, bracketing, and alignment of parallel corpora

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Large-scale language modeling with random forests for mandarin Chinese speech-to-text

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
A new re-ranking method for generic chinese text summarization and its evaluation

ICADL'05 Proceedings of the 8th international conference on Asian Digital Libraries: implementing strategies and sharing experiences
A new method to compose long unknown Chinese keywords

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been based on dictionary lookup (e.g., Chang & Chen 1993; Chiang et al. 1992; Lin et al. 1993; Wu & Tseng 1993; Sproat et al. 1994).We present empirical evidence for four points concerning tokenization of Chinese text: (1) More rigorous "blind" evaluation methodology is needed to avoid inflated accuracy measurements; we introduce the nk-blind method. (2) The extent of the unknown-word problem is far more serious than generally thought, when tokenizing unrestricted texts in realistic domains. (3) Statistical lexical acquisition is a practical means to greatly improve tokenization accuracy with unknown words, reducing error rates as much as 32.0%. (4) When augmenting the lexicon, linguistic constraints can provide simple inexpensive filters yielding significantly better precision, reducing error rates as much as 49.4%.