Classification and knowledge discovery in protein databases

Authors:
Predrag Radivojac;Nitesh V. Chawla;A. Keith Dunker;Zoran Obradovic
Affiliations:
Indianna University School of Informatics and Center for Information Science and Technology, Temple University;Department of Computer Science and Engineering, University of Notre Dame and Customer Behavior Analytics, Canadian Imperial Bank of Commerce, Canada;Center for Computational Biology and Bioinformatics, Indiana University School of Medicine;Center for Information Science and Technology, Temple University
Venue:
Journal of Biomedical Informatics - Special issue: Biomedical machine learning
Year:
2004

Citing 29
Cited 14

A Necessary Condition for Learning from Positive Examples

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Protein Structure Prediction: Bioinformatic Approach

Protein Structure Prediction: Bioinformatic Approach
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Learning From Noisy Examples

Machine Learning
Learning When Negative Examples Abound

ECML '97 Proceedings of the 9th European Conference on Machine Learning
Adjusting the Outputs of a Classifier to New a Priori Probabilities May Significantly Improve Classification Accuracy: Evidence from a multi-class problem in remote sensing

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Prediction of the Number of Residue Contacts in Proteins

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
An introduction to variable and feature selection

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Tree induction vs. logistic regression: a learning-curve analysis

The Journal of Machine Learning Research
Benchmarking Attribute Selection Techniques for Discrete Class Data Mining

IEEE Transactions on Knowledge and Data Engineering
The class imbalance problem: A systematic study

Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
A novelty detection approach to classification

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Guest editorial: research on machine learning issues in biomedical informatics modeling

Journal of Biomedical Informatics - Special issue: Biomedical machine learning
Learning on the border: active learning in imbalanced data classification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Mining sequential patterns for protein fold recognition

Journal of Biomedical Informatics
An information granulation based data mining approach for classifying imbalanced data

Information Sciences: an International Journal
2008 Special Issue: Combining experts in order to identify binding sites in yeast and mouse genomic data

Neural Networks
Learning Decision Trees for Unbalanced Data

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Using granular computing model to induce scheduling knowledge in dynamic manufacturing environments

International Journal of Computer Integrated Manufacturing
MDS: a novel method for class imbalance learning

Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication
Identifying binding sites in sequential genomic data

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Training neural networks for protein secondary structure prediction: the effects of imbalanced data set

ICIC'09 Proceedings of the Intelligent computing 5th international conference on Emerging intelligent computing technology and applications
Multiple-kernel SVM based multiple-task oriented data mining system for gene expression data analysis

Expert Systems with Applications: An International Journal
Improving transcription factor binding site predictions by using randomised negative examples

IPCAT'12 Proceedings of the 9th international conference on Information Processing in Cells and Tissues
Prediction of body mass index status from voice signals based on machine learning for automated medical applications

Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.