A balanced approach to health information evaluation: A vocabulary-based naïve Bayes classifier and readability formulas

  • Authors:
  • Gondy Leroy;Trudi Miller;Graciela Rosemblat;Allen Browne

  • Affiliations:
  • School of Information Systems and Technology, Claremont Graduate University, 130 E. Ninth Street, Claremont, CA 91730;School of Information Systems and Technology, Claremont Graduate University, 130 E. Ninth Street, Claremont, CA 91730;Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD 20894;Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD 20894

  • Venue:
  • Journal of the American Society for Information Science and Technology
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Since millions seek health information online, it is vital forthis information to be comprehensible. Most studies use readabilityformulas, which ignore vocabulary, and conclude that online healthinformation is too difficult. We developed a vocabularly-based,naïve Bayes classifier to distinguish between three difficultylevels in text. It proved 98% accurate in a 250-documentevaluation. We compared our classifier with readability formulasfor 90 new documents with different origins and askedrepresentative human evaluators, an expert and a consumer, to judgeeach document. Average readability grade levels for educational andcommercial pages was 10th grade or higher, too difficult accordingto current literature. In contrast, the classifier showed that70-90% of these pages were written at an intermediate, appropriatelevel indicating that vocabulary usage is frequently appropriate intext considered too difficult by readability formula evaluations.The expert considered the pages more difficult for a consumer thanthe consumer did. © 2008 Wiley Periodicals, Inc.