An evaluation of text retrieval methods for similarity search of multi-dimensional NMR-spectra

Authors:
Alexander Hinneburg;Andrea Porzel;Karina Wolfram
Affiliations:
Institute of Computer Science, Martin-Luther-University of Halle-Wittenberg, Germany;Leibniz Institute of Plant Biochemistry, Germany;Leibniz Institute of Plant Biochemistry, Germany
Venue:
BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
Year:
2007

Citing 8
Cited 0

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments

UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
Text categorization by boosting automatically extracted concepts

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Probabilistic author-topic models for information discovery

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering evolutionary theme patterns from text: an exploration of temporal text mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Similarity search for multi-dimensional NMR-Spectra of natural products

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Searching and mining nuclear magnetic resonance (NMR)- spectra of naturally occurring substances is an important task to investigate new potentially useful chemical compounds. Multi-dimensional NMR-spectra are relational objects like documents, but consists of continuous multi-dimensional points called peaks instead of words. We develop several mappings from continuous NMR-spectra to discrete textlike data. With the help of those mappings any text retrieval method can be applied. We evaluate the performance of two retrieval methods, namely the standard vector space model and probabilistic latent semantic indexing (PLSI). PLSI learns hidden topics in the data, which is in case of 2D-NMR data interesting in its owns rights. Additionally, we develop and evaluate a simple direct similarity function, which can detect duplicates of NMR-spectra. Our experiments show that the vector space model as well as PLSI, which are both designed for text data created by humans, can effectively handle the mapped NMR-data originating from natural products. Additionally, PLSI is able to find meaningful "topics" in the NMR-data.