Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization
Machine Learning - Special issue on applications in molecular biology
Mining Sequential Patterns: Generalizations and Performance Improvements
EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Hi-index | 0.00 |
Pattern discovery in DNA sequences is one of the most challenging tasks in molecular biology and computer science. The main goal of pattern discovery in DNA sequences is to identify sequences of important biological function hidden in the huge amounts of genomic sequences. Several methods and techniques have been proposed and implemented in this field. However, in order to reduce computational time and complexity, most of them either focus on finding short DNA patterns or require explicit specification of pattern lengths in advance. Scientists need to find longer patterns without specifying pattern lengths in advance and still have good performance.In this paper, we propose a pattern discovery algorithm called Pattern Discovery with Confidence (PDC). Based on biological studies, we propose a new measurement system that can identify overrepresented patterns inside DNA sequences. Using this measurement, PDC algorithm can narrow the search space by checking dependency along the pattern, thus extending the pattern as long as possible without the need to restrict or specify the length of a pattern in advance. Experimental tests demonstrate that this approach can find long, interesting patterns within a reasonable computation time.