Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
C4.5: programs for machine learning
C4.5: programs for machine learning
Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Combinatorial pattern discovery for scientific data: some preliminary results
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
The nature of statistical learning theory
The nature of statistical learning theory
Making large-scale support vector machine learning practical
Advances in kernel methods
Growing decision trees on support-less association rules
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
FreeSpan: frequent pattern-projected sequential pattern mining
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Enlarging the Margins in Perceptron Decision Trees
Machine Learning
SPADE: an efficient algorithm for mining frequent sequences
Machine Learning
Mining long sequential patterns in a noisy environment
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
EP '98 Proceedings of the 7th International Conference on Evolutionary Programming VII
Sequential PAttern mining using a bitmap representation
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Heterogeneous Learner for Web Page Classification
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Frequent-subsequence-based prediction of outer membrane proteins
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable sequential pattern mining for biological sequences
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Applying sequential rules to protein localization prediction
Computers & Mathematics with Applications
Kernel-based learning for biomedical relation extraction
Journal of the American Society for Information Science and Technology
g-MARS: Protein Classification Using Gapped Markov Chains and Support Vector Machines
PRIB '08 Proceedings of the Third IAPR International Conference on Pattern Recognition in Bioinformatics
An Automatic Video Text Detection, Localization and Extraction Approach
Advanced Internet Based Systems and Applications
Classifying proteins using gapped Markov feature pairs
Neurocomputing
The forecasting model based on modified SVRM and PSO penalizing Gaussian noise
Expert Systems with Applications: An International Journal
Multi-appliance recognition system with hybrid SVM/GMM classifier in ubiquitous smart home
Information Sciences: an International Journal
Hi-index | 0.00 |
We study the localization prediction of membrane proteins for two families of medically important disease-causing bacteria, called Gram-Negative and Gram-Positive bacteria. Each such bacterium has its cell surrounded by several layers of membranes. Identifying where proteins are located in a bacterial cell is of primary research interest for antibiotic and vaccine drug design. This problem has three requirements: First, with any subsequence of amino acid residues being potentially a dimension, it has an extremely high dimensionality, few being irrelevant. Second, the prediction of a target localization site must have a high precision in order to be useful to biologists, i.e., at least 90 percent or even 95 percent, while recall is as high as possible. Achieving such a precision is made harder by the fact that target sequences are often much fewer than background sequences. Third, the rationale of prediction should be understandable to biologists for taking actions. Meeting all these requirements presents a significant challenge in that a high dimensionality requires a complex model that is often hard to understand. The support vector machine (SVM) model has an outstanding performance in a high-dimensional space, therefore, it addresses the first two requirements. However, the SVM model involves many features in a single kernel function, therefore, it does not address the third requirement. We address all three requirements by integrating the SVM model with a rule-based model, where the understandable if-then rules capture "major structures驴 and the elaborated SVM model captures "subtle structures.驴 Importantly, the integrated model preserves the precision/recall performance of SVM and, at the same time, exposes major structures in a form understandable to the human user. We focus on searching for high quality rules and partitioning the prediction between rules and SVM so as to achieve these properties. We evaluate our method on several membrane localization problems. The purpose of this paper is not improving the precision/recall of SVM, but is manifesting the rationale of a SVM classifier through partitioning the classification between if-then rules and the SVM classifier and preserving the precision/recall of SVM.