Discovering characteristic expressions in literary works

Authors:
Masayuki Takeda;Tetsuya Matsumoto;Tomoko Fukuda;Ichiro Nanri
Affiliations:
Department of Informatics, Kyushu University 33, Fukuoka 812-8581, Japan and PRESTO, Japan Science and Technology Corporation (JST), Kawaguchi 332-0012, Japan;NTT DoCoMo, Inc. and Department of Informatics, Kyushu University 33, Fukuoka 812-8581, Japan;Junshin Women's Junior College, Fukuoka 815-0036, Japan;Junshin Women's Junior College, Fukuoka 815-0036, Japan
Venue:
Theoretical Computer Science
Year:
2003

Citing 5
Cited 7

Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Text algorithms

Text algorithms
From data mining to knowledge discovery: an overview

Advances in knowledge discovery and data mining
Efficient Text Mining with Optimized Pattern Discovery

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Discovering instances of poetic allusion from anthologies of classical Japanese poems

Theoretical Computer Science

Mining from Literary Texts: Pattern Discovery and Similarity Computation

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Discovering Repetitive Expressions and Affinities from Anthologies of Classical Japanese Poems

DS '01 Proceedings of the 4th International Conference on Discovery Science
On-line construction of compact directed acyclic word graphs

Discrete Applied Mathematics - 12th annual symposium on combinatorial pattern matching (CPM)
On-line construction of compact directed acyclic word graphs

Discrete Applied Mathematics
Unsupervised spam detection based on string alienness measures

DS'07 Proceedings of the 10th international conference on Discovery science
Special factors and the combinatorics of suffix and factor automata

Theoretical Computer Science
Efficient computation of substring equivalence classes with suffix arrays

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	5.23

Visualization

Abstract

We attempt to extract characteristic expressions from literary works. That is, given two collections of literary works, one of which is written by a particular author (positive examples) and the other by a different author (negative examples), the problem is to find expressions that appear frequently in the positive examples but which are seldom found in the negative examples. This is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One approach would be to create a list of text substrings sorted according to goodness, and to scrutinize the first part of the list by human efforts. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or phrase. A method to assist domain experts who are involved in this task is a key problem. In this paper, we propose partitioning the text substrings into equivalence classes under an equivalence relation on strings, originally defined by Blumer et al. (J. ACM 34(3) (1987) 578). The equivalence relation has the desirable property that all members of each equivalence class necessarily have a unique goodness value. This idea effectively reduces the inefficiency of the task of evaluating mined patterns. We also present a method for browsing possible superstrings of a focused string as well as its context. We report successful results with two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions may lead to discovering overlooked aspects of individual poets.