Automatically estimating the incidence of symptoms recorded in GP free text notes

Authors:
Rob Koeling;A. Rosemary Tate;John A. Carroll
Affiliations:
University of Sussex, Brighton, United Kingdom;University of Sussex, Brighton, United Kingdom;University of Sussex, Brighton, United Kingdom
Venue:
Proceedings of the first international workshop on Managing interoperability and complexity in health systems
Year:
2011

Citing 5
Cited 2

A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
"Garbage in, garbage out": extracting disease surveillance data from epr systems in primary care

Proceedings of the 2008 ACM conference on Computer supported cooperative work
Building a semantically annotated corpus of clinical texts

Journal of Biomedical Informatics
Improving accuracy of identifying clinical concepts in noisy unstructured clinical notes using existing internal redundancy

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data

Lexical acquisition for clinical text mining using distributional similarity

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
A Review of 25 Years of CSCW Research in Healthcare: Contributions, Challenges and Future Agendas

Computer Supported Cooperative Work

Quantified Score

Hi-index	0.00

Visualization

Abstract

The UK General Practice Research Database (GPRD) is a valuable source of information for health services research. It contains coded data supplemented by free text (physicians' notes and letters). However, due to the difficulty of extracting useful information and the cost of anonymisation, this text is seldom utilised in epidemiological research. We annotated the records of 344 women in the year prior to a diagnosis of ovarian cancer and developed a method for automatically detecting mentions of symptoms in text. We estimated the incidence of five commonly presenting symptoms using: (1) coded symptoms, (2) codes augmented by symptoms automatically extracted from text, and (3) a 'gold standard' dataset of codes and text tagged by three clinically trained annotators. The estimates of incidence of each symptom increased by at least 40% when coded information was enhanced using the manually tagged free text. Our automatic method extracted a significant proportion of this extra information. Our straightforward approach should be extremely useful for medical researchers who wish to validate studies based on codes, or to accurately assess symptoms, using information that can be automatically extracted from unanonymised free text.