Assessing the Statistical Significance of Overrepresented Oligonucleotides
WABI '01 Proceedings of the First International Workshop on Algorithms in Bioinformatics
Finding surprising patterns in a time series database in linear time and space
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining statistically significant substrings using the chi-square statistic
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that differs the most from the expected or normal behavior, i.e., the substrings that are statistically significant (i.e., less likely to occur due to chance alone). To this end, we use the chi-square measure and propose two heuristics for retrieving the top-k substrings with the largest chi-square measure. We show that the algorithms outperform other competing algorithms in the runtime, while maintaining a high approximation ratio of more than 0.96.