Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Entropy rate constancy in text
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Broad coverage paragraph segmentation across languages and domains
ACM Transactions on Speech and Language Processing (TSLP)
Using linguistically motivated features for paragraph boundary identification
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
A noisy-channel model of rational human sentence comprehension under uncertain input
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A rational model of eye movement control in reading
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Close = relevant?: the role of context in efficient language production
CMCL '10 Proceedings of the 2010 Workshop on Cognitive Modeling and Computational Linguistics
A paragraph boundary detection system
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Hi-index | 0.00 |
In this paper we explore the variation of sentences as a function of the sentence number. We demonstrate that while the entropy of the sentence increases with the sentence number, it decreases at the paragraph boundaries in accordance with the Entropy Rate Constancy principle (introduced in related work). We also demonstrate that the principle holds for different genres and languages and explore the role of genre informativeness. We investigate potential causes of entropy variation by looking at the tree depth, the branching factor, the size of constituents, and the occurrence of gapping.