Discovering Characteristic Expressions from Literary Works: A New Text Analysis Method beyond N-Gram Statistics and KWIC

  • Authors:
  • Masayuki Takeda;Tetsuya Matsumoto;Tomoko Fukada;Ichiro Nanri

  • Affiliations:
  • -;-;-;-

  • Venue:
  • DS '00 Proceedings of the Third International Conference on Discovery Science
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.