Comparative analysis of the use of chemoinformatics-based and substructure-based descriptors for quantitative structure-activity relationship QSAR modeling

Authors:
Thashmee Karunaratne;Henrik Boström;Ulf Norinder
Affiliations:
Department of Computer and Systems Sciences, Stockholm University, Kista, Sweden;Department of Computer and Systems Sciences, Stockholm University, Kista, Sweden;AstraZeneca Research and Development, Södertälje, Sweden and Department of Pharmacy, Uppsala University, Uppsala, Sweden and Department of Computational Chemistry, H. Lundbeck A/S, Valby ...
Venue:
Intelligent Data Analysis
Year:
2013

Citing 23
Cited 0

Instance-Based Learning Algorithms

Machine Learning
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Random Forests

Machine Learning
Scalable Algorithms for Association Mining

IEEE Transactions on Knowledge and Data Engineering
MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases

Proceedings of the 17th International Conference on Data Engineering
Mining Molecular Fragments: Finding Relevant Substructures of Molecules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
CloseGraph: mining closed frequent graph patterns

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
SPIN: mining maximal frequent subgraphs from graph databases

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A quickstart in frequent structure mining can make a difference

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining Graph Data

Mining Graph Data
Frequent Substructure-Based Approaches for Classifying Chemical Compounds

IEEE Transactions on Knowledge and Data Engineering
Mining closed relational graphs with connectivity constraints

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
2005 Speical Issue: Graph kernels for chemical informatics

Neural Networks - Special issue on neural networks and kernel methods for structured domains
Subdue: compression-based frequent pattern discovery in graph data

Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Finding reliable subgraphs from large probabilistic graphs

Data Mining and Knowledge Discovery
Partial least squares regression for graph mining

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Correlated itemset mining in ROC space: a constraint programming approach

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Graph Propositionalization for Random Forests

ICMLA '09 Proceedings of the 2009 International Conference on Machine Learning and Applications
GAIA: graph classification using evolutionary computation

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Graph Kernels

The Journal of Machine Learning Research
Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Quantitative structure-activity relationship QSAR models have gained popularity in the pharmaceutical industry due to their potential to substantially decrease drug development costs by reducing expensive laboratory and clinical tests. QSAR modeling consists of two fundamental steps, namely, descriptor discovery and model building. Descriptor discovery methods are either based on chemical domain knowledge or purely data-driven. The former, chemoinformatics-based, and the latter, substructures-based, methods for QSAR modeling, have been developed quite independently. As a consequence, evaluations involving both types of descriptor discovery method are rarely seen. In this study, a comparative analysis of chemoinformatics-based and substructure-based approaches is presented. Two chemoinformatics-based approaches; ECFI and SELMA, are compared to five approaches for substructure discovery; CP, graphSig, MFI, MoFa and SUBDUE, using 18 QSAR datasets. The empirical investigation shows that one of the chemo-informatics-based approaches, ECFI, results in significantly more accurate models compared to all other methods, when used on their own. Results from combining descriptor sets are also presented, showing that the addition of ECFI descriptors to any other descriptor set leads to improved predictive performance for that set, while the use of ECFI descriptors in many cases also can be improved by adding descriptors generated by the other methods.