Most significant substring mining based on chi-square measure

Authors:
Sourav Dutta;Arnab Bhattacharya
Affiliations:
Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India;Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India
Venue:
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Year:
2010

Citing 2
Cited 1

Assessing the Statistical Significance of Overrepresented Oligonucleotides

WABI '01 Proceedings of the First International Workshop on Algorithms in Bioinformatics
Finding surprising patterns in a time series database in linear time and space

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Mining statistically significant substrings using the chi-square statistic

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that differs the most from the expected or normal behavior, i.e., the substrings that are statistically significant (i.e., less likely to occur due to chance alone). To this end, we use the chi-square measure and propose two heuristics for retrieving the top-k substrings with the largest chi-square measure. We show that the algorithms outperform other competing algorithms in the runtime, while maintaining a high approximation ratio of more than 0.96.