Complete inverted files for efficient text retrieval and analysis
Journal of the ACM (JACM)
Text algorithms
From data mining to knowledge discovery: an overview
Advances in knowledge discovery and data mining
Efficient Text Mining with Optimized Pattern Discovery
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Discovering instances of poetic allusion from anthologies of classical Japanese poems
Theoretical Computer Science
Mining from Literary Texts: Pattern Discovery and Similarity Computation
Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Discovering Repetitive Expressions and Affinities from Anthologies of Classical Japanese Poems
DS '01 Proceedings of the 4th International Conference on Discovery Science
On-line construction of compact directed acyclic word graphs
Discrete Applied Mathematics - 12th annual symposium on combinatorial pattern matching (CPM)
On-line construction of compact directed acyclic word graphs
Discrete Applied Mathematics
Unsupervised spam detection based on string alienness measures
DS'07 Proceedings of the 10th international conference on Discovery science
Special factors and the combinatorics of suffix and factor automata
Theoretical Computer Science
Efficient computation of substring equivalence classes with suffix arrays
CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Hi-index | 5.23 |
We attempt to extract characteristic expressions from literary works. That is, given two collections of literary works, one of which is written by a particular author (positive examples) and the other by a different author (negative examples), the problem is to find expressions that appear frequently in the positive examples but which are seldom found in the negative examples. This is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One approach would be to create a list of text substrings sorted according to goodness, and to scrutinize the first part of the list by human efforts. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or phrase. A method to assist domain experts who are involved in this task is a key problem. In this paper, we propose partitioning the text substrings into equivalence classes under an equivalence relation on strings, originally defined by Blumer et al. (J. ACM 34(3) (1987) 578). The equivalence relation has the desirable property that all members of each equivalence class necessarily have a unique goodness value. This idea effectively reduces the inefficiency of the task of evaluating mined patterns. We also present a method for browsing possible superstrings of a focused string as well as its context. We report successful results with two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions may lead to discovering overlooked aspects of individual poets.