Inferring Correlation Between Database Queries: Analysis of Protein Sequence Patterns

Authors:
R. Guigó;T. F. Smith
Affiliations:
-;-
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
1993

Citing 5
Cited 2

Machine learning an artificial intelligence approach volume II

Machine learning an artificial intelligence approach volume II
The AWK programming language

The AWK programming language
Principles of database and knowledge-base systems, Vol. I

Principles of database and knowledge-base systems, Vol. I
Machine induction as a form of knowledge acquisition in knowledge engineering

Machine Learning: Principles and techniques
ARIEL: a massively parallel symbolic learning assistant for protein structure and function

Artificial intelligence at MIT expanding frontiers

Evaluation of an algorithm for finding a match of a distorted texture pattern in a large image database

ACM Transactions on Information Systems (TOIS)
Data discretization for novel relationship discovery in information retrieval

Journal of the American Society for Information Science and Technology

Quantified Score

Hi-index	0.14

Visualization

Abstract

Given a subset P of a database, the problem of finding the query phi in a given database attribute having the closest extension to P is addressed. In the particular case that is outlined, P is the set of protein sequences in a protein sequence database matching a given protein sequence pattern, whereas phi is a query in the annotation of the database. Ideally, phi is the description of a biological function. If the extension of phi is very similar to P, an association between the pattern and the biological function described by the query may be inferred. An algorithm that efficiently searches the query space when negation is not considered is developed. Since the query language is a first-order language, the query space may be mapped into a set algebra in which a measure of stochastic dependence-an asymptotic approximation of the correlation coefficient-is used as a measure of set similarity. The algorithm uses the algebraic properties of such a measure to reduce the time required to search the query space. A prototype implementation of the algorithm has been tested in different collections of protein sequence patterns.