Fast approximate duplicate detection for 2D-NMR spectra

Authors:
Björn Egert;Steffen Neumann;Alexander Hinneburg
Affiliations:
Leibniz Institute of Plant Biochemistry, Department of Stress and Developmental Biology, Germany;Leibniz Institute of Plant Biochemistry, Department of Stress and Developmental Biology, Germany;Institute of Computer Science, Martin-Luther-University of Halle-Wittenberg, Germany
Venue:
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Year:
2007

Citing 18
Cited 0

I-COLLIDE: an interactive and exact collision detection system for large-scale environments

I3D '95 Proceedings of the 1995 symposium on Interactive 3D graphics
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Efficient and tumble similar set retrieval

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Detecting duplicate objects in XML documents

Proceedings of the 2004 international workshop on Information quality in information systems
An efficient parts-based near-duplicate and sub-image retrieval system

Proceedings of the 12th annual ACM international conference on Multimedia
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
A hit-miss model for duplicate detection in the WHO drug safety database

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Managing duplicates in a web archive

Proceedings of the 2006 ACM symposium on Applied computing
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

2D-Nuclear magnetic resonance (NMR) spectroscopy is a powerful analytical method to elucidate the chemical structure of molecules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of 1H and 13C simultaneously. To curate or merge large spectra libraries a robust (and fast) duplicate detection is needed. We propose a definition of duplicates with the desired robustness properties mandatory for 2D-NMR experiments. A major gain in runtime performance wrt. previously proposed heuristics is achieved by mapping the spectra to simple discrete objects. We propose several appropriate data transformations for this task. In order to compensate for slight variations of the mapped spectra, we use appropriate hashing functions according to the locality sensitive hashing scheme, and identify duplicates by hash-collisions.