Mining Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts

Authors:
Daisuke Ikeda;Einoshin Suzuki
Affiliations:
Department of Informatics, Kyushu University, Fukuoka, Japan 819-0395;Department of Informatics, Kyushu University, Fukuoka, Japan 819-0395
Venue:
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Year:
2009

Citing 16
Cited 0

From data mining to knowledge discovery: an overview

Advances in knowledge discovery and data mining
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Small is beautiful: discovering the minimal set of unexpected patterns

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovery-Driven Exploration of OLAP Data Cubes

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Finding Intensional Knowledge of Distance-Based Outliers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Linear-Time Algorithm for Computing Characteristic Strings

ISAAC '94 Proceedings of the 5th International Symposium on Algorithms and Computation
Evaluating Hypothesis-Driven Exception-Rule Discovery with Medical Data Sets

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Characteristic Sets of Strings Common to Semi-structured Documents

DS '99 Proceedings of the Second International Conference on Discovery Science
Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words

ISAAC '98 Proceedings of the 9th International Symposium on Algorithms and Computation
Finding surprising patterns in a time series database in linear time and space

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
CLOSET+: searching for the best strategies for mining frequent closed itemsets

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Unsupervised Spam Detection by Document Complexity Estimation

DS '08 Proceedings of the 11th International Conference on Discovery Science
Pattern Discovery in Bioinformatics: Theory & Algorithms

Pattern Discovery in Bioinformatics: Theory & Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider mining unusual patterns from text T . Unlike existing methods which assume probabilistic models and use simple estimation methods, we employ a set B of background text in addition to T and composition s w = xy of x and y as patterns. A string w is peculiar if there exist x and y such that w = xy , each of x and y is more frequent in B than in T , and conversely w = xy is more frequent in T . The frequency of xy in T is very small since x and y are infrequent in T , but xy is relatively abundant in T compared to xy in B . Despite these complex conditions for peculiar compositions, we develop a fast algorithm to find peculiar compositions using the suffix tree. Experiments using DNA sequences show scalability of our algorithm due to our pruning techniques and the superiority of the concept of the peculiar composition.