Comparing entropies within the chinese language

Authors:
Benjamin K. Tsou;Tom B. Y. Lai;Ka-po Chow
Affiliations:
Language Information Sciences Research Centre, City University of Hong Kong, Hong Kong;Language Information Sciences Research Centre, City University of Hong Kong, Hong Kong;Language Information Sciences Research Centre, City University of Hong Kong, Hong Kong
Venue:
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Year:
2004

Citing 3
Cited 1

Text compression

Text compression
An estimate of an upper bound for the entropy of English

Computational Linguistics
The entropy of English using PPM-based models

DCC '96 Proceedings of the Conference on Data Compression

Creating multilingual translation lexicons with regional variations using web corpora

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Using a large synchronous Chinese corpus, we show how word and character entropy variations exhibit interesting differences in terms of time and space for different Chinese speech communities. We find that word entropy values are affected by the quality of the segmentation process. We also note that word entropies can be affected by proper nouns, which is the most volatile segment of the stable lexicon of the language. Our word and character entropy results provide interesting comparison with the earlier results and the average joint character entropies (a.k.a. entropy rates) of Chinese up to order 20 provided by us indicate that the limits of the conditional character entropies of Chinese for the different speech communities should be about 1 (or less). This invites questions on whether early convergence of character entropies would also entail word entropy convergence.