A balanced approach to health information evaluation: A vocabulary-based naïve Bayes classifier and readability formulas

Authors:
Gondy Leroy;Trudi Miller;Graciela Rosemblat;Allen Browne
Affiliations:
School of Information Systems and Technology, Claremont Graduate University, 130 E. Ninth Street, Claremont, CA 91730;School of Information Systems and Technology, Claremont Graduate University, 130 E. Ninth Street, Claremont, CA 91730;Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD 20894;Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD 20894
Venue:
Journal of the American Society for Information Science and Technology
Year:
2008

Citing 10
Cited 1

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Toward wellness: women seeking health information

Journal of the American Society for Information Science and Technology - Part I: Information seeking research
A study of web usability for older adults seeking online health resources

ACM Transactions on Computer-Human Interaction (TOCHI)
Generalized Naive Bayes Classifiers

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
http://www.nihseniorhealth.gov: the process of construction and revision in the development of a model web site for use by older adults

Universal Access in the Information Society
Neo-tribes: the power and potential of online communities in health care

Communications of the ACM - Personal information management
Online health communities

CHI '06 Extended Abstracts on Human Factors in Computing Systems
Multi-field information extraction and cross-document fusion

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A Classifier to Evaluate Language Specificity of Medical Documents

HICSS '07 Proceedings of the 40th Annual Hawaii International Conference on System Sciences
Better informed training of latent syntactic features

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Domain-specific iterative readability computation

Proceedings of the 10th annual joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Since millions seek health information online, it is vital forthis information to be comprehensible. Most studies use readabilityformulas, which ignore vocabulary, and conclude that online healthinformation is too difficult. We developed a vocabularly-based,naïve Bayes classifier to distinguish between three difficultylevels in text. It proved 98% accurate in a 250-documentevaluation. We compared our classifier with readability formulasfor 90 new documents with different origins and askedrepresentative human evaluators, an expert and a consumer, to judgeeach document. Average readability grade levels for educational andcommercial pages was 10th grade or higher, too difficult accordingto current literature. In contrast, the classifier showed that70-90% of these pages were written at an intermediate, appropriatelevel indicating that vocabulary usage is frequently appropriate intext considered too difficult by readability formula evaluations.The expert considered the pages more difficult for a consumer thanthe consumer did. © 2008 Wiley Periodicals, Inc.