Finite State Models for the Generation of Large Corpora of Natural Language Texts

Authors:
Domenico Cantone;Salvatore Cristofaro;Simone Faro;Emanuele Giaquinta
Affiliations:
Università di Catania, Dipartimento di Matematica e Informatica, Viale Andrea Doria 6, I-95125 Catania, Italy, {cantone | cristofaro | faro | giaquinta}@dmi.unict.it;Università di Catania, Dipartimento di Matematica e Informatica, Viale Andrea Doria 6, I-95125 Catania, Italy, {cantone | cristofaro | faro | giaquinta}@dmi.unict.it;Università di Catania, Dipartimento di Matematica e Informatica, Viale Andrea Doria 6, I-95125 Catania, Italy, {cantone | cristofaro | faro | giaquinta}@dmi.unict.it;Università di Catania, Dipartimento di Matematica e Informatica, Viale Andrea Doria 6, I-95125 Catania, Italy, {cantone | cristofaro | faro | giaquinta}@dmi.unict.it
Venue:
Proceedings of the 2009 conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008
Year:
2009

Citing 5
Cited 0

Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Factor Oracle: A New Structure for Pattern Matching

SOFSEM '99 Proceedings of the 26th Conference on Current Trends in Theory and Practice of Informatics on Theory and Practice of Informatics
Protein Is Incompressible

DCC '99 Proceedings of the Conference on Data Compression
Probabilistic Finite-State Machines-Part I

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to probabilistic automata (Computer science and applied mathematics)

Introduction to probabilistic automata (Computer science and applied mathematics)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. In many cases the problem due to the lack of big corpus of natural language texts can be solved by simply concatenating a set of collected texts, even with heterogeneous contexts and by different authors. In this note we present a preliminary study on a finite state model for text generation which maintains statistical and structural characteristics of natural language texts, i.e., Zipf's law and inverse-rank power law, thus providing a very good approximation for testing purposes.