Identification of gene function using prediction by partial matching (PPM) language models

Authors:
Malika Mahoui;William John Teahan;Arvind Kumar Thirumalaiswamy Sekhar;Satyasaibabu Chilukuri
Affiliations:
IUPUI, Indianapolis, IN, USA;University of Wales, Bangor, Wales, United Kngdm;Dow AgroSciences, Indianapolis, IN, USA;IUPUI, Indianapolis, IN, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 12
Cited 0

Text compression

Text compression
The design and analysis of efficient lossless data compression systems

The design and analysis of efficient lossless data compression systems
Mining online text

Communications of the ACM
Text Mining: A New Frontier for Lossless Compression

DCC '99 Proceedings of the Conference on Data Compression
An Open Interface for Probabilistic Models of Text

DCC '99 Proceedings of the Conference on Data Compression
Using Compression to Identify Acronyms in Text

DCC '00 Proceedings of the Conference on Data Compression
Combining PPM Models Using A Text Mining Approach

DCC '01 Proceedings of the Data Compression Conference
A pathway editor for literature-based knowledge curation

APBC '04 Proceedings of the second conference on Asia-Pacific bioinformatics - Volume 29
GAPSCORE: finding gene and protein names one word at a time

Bioinformatics
Protein names precisely peeled off free text

Bioinformatics
Automatic extraction of gene/protein biological functions from biomedical text

Bioinformatics
Protein name tagging for biomedical annotation in text

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe the utilization of text encoding and prediction by partial matching language modeling to identify gene functions within abstracts of biomedical papers. The National Center for Biotechnology Information has "GeneRIF" - a collection of the best possible functional representations for a subset of abstracts from PubMed. We use GeneRIF to test the efficiency of our technique. We discuss the methodology adopted to construct models necessary to enable the Text Mining Toolkit to distinguish between gene functions and the rest of the abstract (non gene functions). We also describe the similarity based approach we deploy on the list of automatically annotated functions to generate the most likely gene function representative of the paper. The results indicate that our combined approach to identify gene functions in scientific abstracts performs very well on both precision and recall, and therefore presents exciting opportunities for use in extracting other entities embedded in scientific text.