Finite State Models for the Generation of Large Corpora of Natural Language Texts

  • Authors:
  • Domenico Cantone;Salvatore Cristofaro;Simone Faro;Emanuele Giaquinta

  • Affiliations:
  • Università di Catania, Dipartimento di Matematica e Informatica, Viale Andrea Doria 6, I-95125 Catania, Italy, {cantone | cristofaro | faro | giaquinta}@dmi.unict.it;Università di Catania, Dipartimento di Matematica e Informatica, Viale Andrea Doria 6, I-95125 Catania, Italy, {cantone | cristofaro | faro | giaquinta}@dmi.unict.it;Università di Catania, Dipartimento di Matematica e Informatica, Viale Andrea Doria 6, I-95125 Catania, Italy, {cantone | cristofaro | faro | giaquinta}@dmi.unict.it;Università di Catania, Dipartimento di Matematica e Informatica, Viale Andrea Doria 6, I-95125 Catania, Italy, {cantone | cristofaro | faro | giaquinta}@dmi.unict.it

  • Venue:
  • Proceedings of the 2009 conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. In many cases the problem due to the lack of big corpus of natural language texts can be solved by simply concatenating a set of collected texts, even with heterogeneous contexts and by different authors. In this note we present a preliminary study on a finite state model for text generation which maintains statistical and structural characteristics of natural language texts, i.e., Zipf's law and inverse-rank power law, thus providing a very good approximation for testing purposes.