Data mining: concepts and techniques
Data mining: concepts and techniques
A technique for computer detection and correction of spelling errors
Communications of the ACM
Designing Functional Dependencies for XML
EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Hi-index | 0.00 |
An important prerequisite to successfully integrating protein data is detecting duplicate records spread across different databases. In this paper, we describe a new framework for protein entity resolution, called PERF, which deduplicates protein mentionsusing a wide range of protein attributes. A mentionrefers to any recorded information about a protein, whether it is derived from a database, a high-throughput study, or literature text mining, among others. PERF can be easily extended to deduplicate protein-protein interactions (PPIs) as well. This framework translates mentionsinto instances of a reference schema to facilitate mentioncomparisons. PERF also uses "virtual attribute dependencies" to "enhance" mentionswith additional attribute values. PERF computes a likelihood measure based upon the textual value similarity of mentionattributes. A prototype implementation of the framework was tested, and these tests indicate that PERF can clearly separate duplicate mentions from non-duplicate mentions.