IEEE Transactions on Pattern Analysis and Machine Intelligence
Factor Oracle: A New Structure for Pattern Matching
SOFSEM '99 Proceedings of the 26th Conference on Current Trends in Theory and Practice of Informatics on Theory and Practice of Informatics
DCC '99 Proceedings of the Conference on Data Compression
Probabilistic Finite-State Machines-Part I
IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to probabilistic automata (Computer science and applied mathematics)
Introduction to probabilistic automata (Computer science and applied mathematics)
Hi-index | 0.00 |
Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. In many cases the problem due to the lack of big corpus of natural language texts can be solved by simply concatenating a set of collected texts, even with heterogeneous contexts and by different authors. In this note we present a preliminary study on a finite state model for text generation which maintains statistical and structural characteristics of natural language texts, i.e., Zipf's law and inverse-rank power law, thus providing a very good approximation for testing purposes.