A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

  • Authors:
  • Makoto Nagao;Shinsuke Mori

  • Affiliations:
  • Kyoto University;Kyoto University

  • Venue:
  • COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the process of establishing the information theory, C. E. Shannon proposed the Markov process as a good model to characterize a natural language. The core of this idea is to calculate the frequencies of strings composed of n characters (n-grams), but this statistical analysis of large text data and for a large n has never been carried out because of the memory limitation of computer and the shortage of text data. Taking advantage of the recent powerful computers we developed a new algorithm of n-grams of large text data for arbitrary large n and calculated successfully, within relatively short time, n-grams of some Japanese text data containing between two and thirty million characters. From this experiment it became clear that the automatic extraction or determination of words, compound words and collocations is possible by mutually comparing n-gram statistics for different values of n.