On optimizing the non-metric similarity search in tandem mass spectra by clustering

Authors:
Jiří Novák;David Hoksza;Jakub Lokoč;Tomáš Skopal
Affiliations:
Siret Research Group, Faculty of Mathematics and Physics, Charles University in Prague, Prague, Czech Republic;Siret Research Group, Faculty of Mathematics and Physics, Charles University in Prague, Prague, Czech Republic;Siret Research Group, Faculty of Mathematics and Physics, Charles University in Prague, Prague, Czech Republic;Siret Research Group, Faculty of Mathematics and Physics, Charles University in Prague, Prague, Czech Republic
Venue:
ISBRA'12 Proceedings of the 8th international conference on Bioinformatics Research and Applications
Year:
2012

Citing 9
Cited 1

M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
A fast coarse filtering method for peptide identification by mass spectrometry

Bioinformatics
Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search

Bioinformatics
Unified framework for fast exact and approximate search in dissimilarity spaces

ACM Transactions on Database Systems (TODS)
NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces

DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
An inverted index for mass spectra similarity query and comparison with a metric-space method: case study

Proceedings of the Third International Conference on SImilarity Search and APplications
Non-metric similarity search of tandem mass spectra including posttranslational modifications

Journal of Discrete Algorithms
Survey of clustering algorithms

IEEE Transactions on Neural Networks

SimTandem: similarity search in tandem mass spectra

SISAP'12 Proceedings of the 5th international conference on Similarity Search and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tandem mass spectrometry is a well-known technique for identification of protein sequences from an "in vitro" sample. To identify the sequences from spectra captured by a spectrometer, the similarity search in a database of hypothetical mass spectra is often used. For this purpose, a database of known protein sequences is utilized to generate the hypothetical spectra. Since the number of sequences in the databases grows rapidly over the time, several approaches have been proposed to index the databases of mass spectra. In this paper, we improve an approach based on the non-metric similarity search where the M-tree and the TriGen algorithm are employed for fast and approximative search. We show that preprocessing of mass spectra by clustering speeds up the identification of sequences more than 100× with respect to the sequential scan of the entire database. Moreover, when the protein candidates are refined by sequential scan in the postprocessing step, the whole approach exhibits precision similar to that of sequential scan over the entire database (over 90%).