The method of N-grams in large-scale clustering of DNA texts

Authors:
Z. Volkovich;V. Kirzhner;A. Bolshoy;E. Nevo;A. Korol
Affiliations:
Department of Software Engineering, ORT Braude College, P.O. Box 78, Karmiel, 20101, Israel;Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel;Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel;Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel;Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel
Venue:
Pattern Recognition
Year:
2005

Citing 10
Cited 3

Algorithms for clustering data

Algorithms for clustering data
Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Highlights: language- and domain-independent automatic indexing terms for abstracting

Journal of the American Society for Information Science
Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Understanding search engines: mathematical modeling and text retrieval

Understanding search engines: mathematical modeling and text retrieval
Sequencing by hybridization using direct and reverse cooperating spectra

Proceedings of the sixth annual international conference on Computational biology
Text Mining with Information-Theoretic Clustering

Computing in Science and Engineering
Tree-structured Partitioning Based on Splitting Histograms of Distances

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Sequencing-by-Hybridization Revisited: The Analog-Spectrum Proposal

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Minimal-dot plot: "Old tale in new skin" about sequence comparison

Information Sciences: an International Journal
Classification of Tandem Repeats in the Human Genome

International Journal of Knowledge Discovery in Bioinformatics
Classification of Tandem Repeats in the Human Genome

International Journal of Knowledge Discovery in Bioinformatics

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper is devoted to the techniques of clustering of texts based on the comparison of vocabularies of N-grams. In contrast to the regular N-grams approach, the proposed N-grams method is based on calculation of imperfect occurrences of N-grams in a text up to a number of mismatched strings. We demonstrated that such an approach essentially improves the resolving capacity of the N-grams method for DNA texts. Additionally, we discuss a mutual usage scheme of different clustering technique types to verify the partition quality.