Discovering Characteristic Expressions from Literary Works: A New Text Analysis Method beyond N-Gram Statistics and KWIC

Authors:
Masayuki Takeda;Tetsuya Matsumoto;Tomoko Fukada;Ichiro Nanri
Affiliations:
-;-;-;-
Venue:
DS '00 Proceedings of the Third International Conference on Discovery Science
Year:
2000

Citing 6
Cited 0

Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Text algorithms

Text algorithms
Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
From data mining to knowledge discovery: an overview

Advances in knowledge discovery and data mining
Discovering Characteristic Patterns from Collections of Classical Japanese Poems

DS '98 Proceedings of the First International Conference on Discovery Science
Discovering Poetic Allusion in Anthologies of Classical Japanese Poems

DS '99 Proceedings of the Second International Conference on Discovery Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.