Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts

Authors:
Kristina Doing-Harris;Olga Patterson;Sean Igo;John Hurdle
Affiliations:
University of Utah Health Sciences Center, Salt Lake City, UT, USA;VA SLC Health Care, Salt Lake City, UT, USA;University of Utah Health Sciences Center, Salt Lake City, UT, USA;University of Utah Health Sciences Center, Salt Lake City, UT, USA
Venue:
Proceedings of the 7th international workshop on Data and text mining in biomedical informatics
Year:
2013

Citing 4
Cited 1

The structure of science information

Journal of Biomedical Informatics - Special issue: Sublanguage
Two biomedical sublanguages: a description based on the theories of Zellig Harris

Journal of Biomedical Informatics - Special issue: Sublanguage
Domain adaptation for statistical classifiers

Journal of Artificial Intelligence Research
Linguistic structure prediction with the sparseptron

XRDS: Crossroads, The ACM Magazine for Students - Scientific Computing

DTMBIO 2013: international workshop on data and text mining in biomedical informatics

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports on a set of studies designed to identify sublanguages in documents for domain-specific processing across institutions. Psychological evidence indicates that humans use context-specific linguistic information when they read. Natural Language Processing (NLP) pipelines are successful within specific domains (i.e., contexts). To limit the number of domain-specific NLP systems, a natural focus would be on sublanguages. Sublanguages are identified by shared lexical and semantic features.[1] Patterson and Hurdle[2] developed a sublanguage identification system that functioned well for 12 clinical specialties at the University of Utah. The current work compares sublanguages across institutions. Using a clinical NLP pipeline augmented by a new document corpus from the University of Pittsburg (UPitt), new documents were assigned to clusters based on the minimum cosine-distance to a Utah cluster centroid. The UPitt documents were divided into a nine-group specialty corpus. Across institutions, five of the specialty groups fell within the expected clusters. We find that clustering encounters difficulty due to documents with mixed sublanguages; naming convention differences across institutions; and document types used across specialties. The findings indicate that clinical specialty sublanguages can be identified across institutions.