Statistical distribution of chemical fingerprints

Authors:
S. Joshua Swamidass;Pierre Baldi
Affiliations:
Department of Computer Science, Institute for Genomics and Bioinformatics, University of California, Irvine, CA;Department of Computer Science, Institute for Genomics and Bioinformatics, University of California, Irvine, CA
Venue:
WILF'05 Proceedings of the 6th international conference on Fuzzy Logic and Applications
Year:
2005

Citing 2
Cited 0

Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity

Bioinformatics
2005 Speical Issue: Graph kernels for chemical informatics

Neural Networks - Special issue on neural networks and kernel methods for structured domains

Quantified Score

Hi-index	0.00

Visualization

Abstract

Binary fingerprints are binary vectors used to represent chemical molecules by recording the presence or absence of particular substructures, such as labeled paths in the 2D graph of bonds. Complete fingerprints are often reduced to a compressed format–of typical dimension n = 512 or n = 1024–by using a simple congruence operation. The statistical properties of complete or compressed fingerprints representations are important since fingerprints are used to rapidly search large databases and to develop statistical machine learning methods in chemoinformatics. Here we present an empirical and mathematical analysis of the distribution of complete and compressed fingerprints. In particular, we derive formulas that provide good approximation for the expected number of bits set to one in a compressed fingerprint, given its uncompressed version, and vice versa.