An Entity Resolution Framework for Deduplicating Proteins

Authors:
Lucas Lochovsky;Thodoros Topaloglou
Affiliations:
Department of Computer Science, University of Toronto,;Department of Computer Science, University of Toronto,
Venue:
DILS '08 Proceedings of the 5th international workshop on Data Integration in the Life Sciences
Year:
2008

Citing 3
Cited 0

Data mining: concepts and techniques

Data mining: concepts and techniques
A technique for computer detection and correction of spelling errors

Communications of the ACM
Designing Functional Dependencies for XML

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important prerequisite to successfully integrating protein data is detecting duplicate records spread across different databases. In this paper, we describe a new framework for protein entity resolution, called PERF, which deduplicates protein mentionsusing a wide range of protein attributes. A mentionrefers to any recorded information about a protein, whether it is derived from a database, a high-throughput study, or literature text mining, among others. PERF can be easily extended to deduplicate protein-protein interactions (PPIs) as well. This framework translates mentionsinto instances of a reference schema to facilitate mentioncomparisons. PERF also uses "virtual attribute dependencies" to "enhance" mentionswith additional attribute values. PERF computes a likelihood measure based upon the textual value similarity of mentionattributes. A prototype implementation of the framework was tested, and these tests indicate that PERF can clearly separate duplicate mentions from non-duplicate mentions.