Linear-Time Computation of Similarity Measures for Sequential Data
The Journal of Machine Learning Research
Human Pol II promoter prediction by using nucleotide property composition features
ISB '10 Proceedings of the International Symposium on Biocomputing
SCS: Signal, Context, and Structure Features for Genome-Wide Human Promoter Recognition
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
The SHOGUN Machine Learning Toolbox
The Journal of Machine Learning Research
A unifying view of multiple kernel learning
ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
A metastate HMM with application to gene structure identification in eukaryotes
EURASIP Journal on Advances in Signal Processing - Special issue on genomic signal processing
lp-Norm Multiple Kernel Learning
The Journal of Machine Learning Research
Efficient algorithms for similarity measures over sequential data: a look beyond kernels
DAGM'06 Proceedings of the 28th conference on Pattern Recognition
The poisson margin test for normalisation free significance analysis of NGS data
RECOMB'10 Proceedings of the 14th Annual international conference on Research in Computational Molecular Biology
ICIC'11 Proceedings of the 7th international conference on Intelligent Computing: bio-inspired computing and applications
Similarity measures for sequential data
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Sally: a tool for embedding strings in vector spaces
The Journal of Machine Learning Research
Hi-index | 3.84 |
We develop new methods for finding transcription start sites (TSS) of RNA Polymerase II binding genes in genomic DNA sequences. Employing Support Vector Machines with advanced sequence kernels, we achieve drastically higher prediction accuracies than state-of-the-art methods. Motivation: One of the most important features of genomic DNA are the protein-coding genes. While it is of great value to identify those genes and the encoded proteins, it is also crucial to understand how their transcription is regulated. To this end one has to identify the corresponding promoters and the contained transcription factor binding sites. TSS finders can be used to locate potential promoters. They may also be used in combination with other signal and content detectors to resolve entire gene structures. Results: We have developed a novel kernel based method – called ARTS – that accurately recognizes transcription start sites in human. The application of otherwise too computationally expensive Support Vector Machines was made possible due to the use of efficient training and evaluation techniques using suffix tries. In a carefully designed experimental study, we compare our TSS finder to state-of-the-art methods from the literature: McPromoter, Eponine and FirstEF. For given false positive rates within a reasonable range, we consistently achieve considerably higher true positive rates. For instance, ARTS finds about 35% true positives at a false positive rate of 1/1000, where the other methods find about a half (18%). Availability: Datasets, model selection results, whole genome predictions, and additional experimental results are available at Contact: Gunnar.Raetsch@tuebingen.mpg.de