Fast approximate duplicate detection for 2D-NMR spectra

  • Authors:
  • Björn Egert;Steffen Neumann;Alexander Hinneburg

  • Affiliations:
  • Leibniz Institute of Plant Biochemistry, Department of Stress and Developmental Biology, Germany;Leibniz Institute of Plant Biochemistry, Department of Stress and Developmental Biology, Germany;Institute of Computer Science, Martin-Luther-University of Halle-Wittenberg, Germany

  • Venue:
  • DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

2D-Nuclear magnetic resonance (NMR) spectroscopy is a powerful analytical method to elucidate the chemical structure of molecules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of 1H and 13C simultaneously. To curate or merge large spectra libraries a robust (and fast) duplicate detection is needed. We propose a definition of duplicates with the desired robustness properties mandatory for 2D-NMR experiments. A major gain in runtime performance wrt. previously proposed heuristics is achieved by mapping the spectra to simple discrete objects. We propose several appropriate data transformations for this task. In order to compensate for slight variations of the mapped spectra, we use appropriate hashing functions according to the locality sensitive hashing scheme, and identify duplicates by hash-collisions.