Validating associations in biological databases

Authors:
Francisco M. Couto;Mário J. Silva;Pedro M. Coutinho
Affiliations:
University of Lisboa;University of Lisboa;Centre National de la Recherche Scientifique
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 5
Cited 0

Java language reference

Java language reference
Classifying biological articles using web resources

Proceedings of the 2004 ACM symposium on Applied computing
Untangling text data mining

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors

Proceedings of the 14th ACM international conference on Information and knowledge management
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Erroneous data can often be found in databases, and detecting it is normally a non-trivial task. For example, To cope with the large amount of biological sequences being produced, a significant number of genes and proteins have been annotated by automated tools. A protein annotation is an association between a protein and a term describing its role. These tools have produced a significant number of misannotations that are now present in biological databases. This paper proposes a new method for automatically scoring associations by comparing them to preexisting curated associations. An association is a pair that links two entities. The score can be used to filter incorrect or uncommon associations.We evaluated the method using the automated protein annotations submitted to BioCreAtIvE, an international evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold.