An Entity Resolution Framework for Deduplicating Proteins

  • Authors:
  • Lucas Lochovsky;Thodoros Topaloglou

  • Affiliations:
  • Department of Computer Science, University of Toronto,;Department of Computer Science, University of Toronto,

  • Venue:
  • DILS '08 Proceedings of the 5th international workshop on Data Integration in the Life Sciences
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

An important prerequisite to successfully integrating protein data is detecting duplicate records spread across different databases. In this paper, we describe a new framework for protein entity resolution, called PERF, which deduplicates protein mentionsusing a wide range of protein attributes. A mentionrefers to any recorded information about a protein, whether it is derived from a database, a high-throughput study, or literature text mining, among others. PERF can be easily extended to deduplicate protein-protein interactions (PPIs) as well. This framework translates mentionsinto instances of a reference schema to facilitate mentioncomparisons. PERF also uses "virtual attribute dependencies" to "enhance" mentionswith additional attribute values. PERF computes a likelihood measure based upon the textual value similarity of mentionattributes. A prototype implementation of the framework was tested, and these tests indicate that PERF can clearly separate duplicate mentions from non-duplicate mentions.