Substring selection for biomedical document classification

Authors:
Bo Han;Zoran Obradovic;Zhang-Zhi Hu;Cathy H. Wu;Slobodan Vucetic
Affiliations:
Center for Information Science and Technology, Temple University Philadelphia, PA 19122, USA;Center for Information Science and Technology, Temple University Philadelphia, PA 19122, USA;Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center Washington DC 20007, USA;Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center Washington DC 20007, USA;Center for Information Science and Technology, Temple University Philadelphia, PA 19122, USA
Venue:
Bioinformatics
Year:
2006

Citing 0
Cited 6

Substring selection for biomedical document classification

TMBIO '06 Proceedings of the 1st international workshop on Text mining in bioinformatics
Learning to Find Relevant Biological Articles without Negative Training Examples

AI '08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
A novel efficient classification algorithm for search engines

AIC'08 Proceedings of the 8th conference on Applied informatics and communications
User-driven development of text mining resources for cancer risk assessment

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Figure classification in biomedical literature to elucidate disease mechanisms, based on pathways

Artificial Intelligence in Medicine
Protein-Protein interactions classification from text via local learning with class priors

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Owing to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes. Results: The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and support vector machine classifiers perform consistently better [with area under the ROC curve (AUC) accuracy in range 0.92--0.97] when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86--0.93 range). The proposed approach is particularly useful when labeled datasets are small. Contact: vucetic@ist.temple.edu Supplementary Information: The supplementary data are available from www.ist.temple.edu/PIRsupplement