Automated identification of protein classification and detection of annotation errors in protein databases using statistical approaches

  • Authors:
  • Kang Ning;Hon Nian Chua

  • Affiliations:
  • School of Computing, National University of Singapore, Singapore;National University of Singapore

  • Venue:
  • KDLL'06 Proceedings of the 2006 international conference on Knowledge Discovery in Life Science Literature
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Because of the importance of proteins in life sciences, biologists have put great effort to elucidate their structures, functions and expression profiles to help us understand their roles in living cells in the past few decades. Currently, protein databases are widely used by biologists. Hence it is critical that the information that researcher work with should be as accurate as possible. However, the sizes of these databases are increasing rapidly, and existing protein databases are already known to contain annotation errors. In this paper, we investigate the reason why protein databases possess mis-annotated sequence data. Then, by using some statistical approaches, we derive a method to automatically filter and assess the reliability of the data from databases. This is important to provide accurate information to researchers and will help reduce further errors in annotation resulting from existed mis-annotated sequence data. Our initial experiments proved our theoretical findings, and show that our methods can effectively detect the mis-annotated sequence data.