Text compression
An estimate of an upper bound for the entropy of English
Computational Linguistics
The entropy of English using PPM-based models
DCC '96 Proceedings of the Conference on Data Compression
Creating multilingual translation lexicons with regional variations using web corpora
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Hi-index | 0.00 |
Using a large synchronous Chinese corpus, we show how word and character entropy variations exhibit interesting differences in terms of time and space for different Chinese speech communities. We find that word entropy values are affected by the quality of the segmentation process. We also note that word entropies can be affected by proper nouns, which is the most volatile segment of the stable lexicon of the language. Our word and character entropy results provide interesting comparison with the earlier results and the average joint character entropies (a.k.a. entropy rates) of Chinese up to order 20 provided by us indicate that the limits of the conditional character entropies of Chinese for the different speech communities should be about 1 (or less). This invites questions on whether early convergence of character entropies would also entail word entropy convergence.