Comparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty

Authors:
Susanne M. Humphrey;Aurélie Névéol;Allen Browne;Julien Gobeil;Patrick Ruch;Stéfan J. Darmoni
Affiliations:
(Retired from U.S. National Library of Medicine) U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894;U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894;U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894;Medical Informatics Service, University and University Hospitals of Geneva, CH-1211 Geneva 14, Switzerland;BiTeM Group, Information Science Department, University of Applied Science, Geneva, 7 Drize, 1227 Carouge, Switzerland;CISMeF Group, Rouen University Hospital & GCSIS, LITIS EA 4108, Institute of BioMedical Research, University of Rouen, 1 rue de Germont, 76031 Rouen Cedex, France
Venue:
Journal of the American Society for Information Science and Technology
Year:
2009

Citing 8
Cited 1

Automatic indexing of documents from journal descriptors: a preliminary investigation

Journal of the American Society for Information Science
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment

Journal of the American Society for Information Science and Technology
Automatic assignment of biomedical categories: toward a generic approach

Bioinformatics
From indexing the biomedical literature to coding clinical text: experience with MTI and machine learning approaches

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
A recent advance in the automatic indexing of the biomedical literature

Journal of Biomedical Informatics

Use of Medical Subject Headings (MeSH) in Portuguese for categorizing web-based healthcare content

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including, Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings (MeSH) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI), based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for 100 MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures performance is comparable, and for one measure JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule-based) might be combined and then evaluated showing they are complementary to one another. © 2009 Wiley Periodicals, Inc.