Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts

  • Authors:
  • Kristina Doing-Harris;Olga Patterson;Sean Igo;John Hurdle

  • Affiliations:
  • University of Utah Health Sciences Center, Salt Lake City, UT, USA;VA SLC Health Care, Salt Lake City, UT, USA;University of Utah Health Sciences Center, Salt Lake City, UT, USA;University of Utah Health Sciences Center, Salt Lake City, UT, USA

  • Venue:
  • Proceedings of the 7th international workshop on Data and text mining in biomedical informatics
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper reports on a set of studies designed to identify sublanguages in documents for domain-specific processing across institutions. Psychological evidence indicates that humans use context-specific linguistic information when they read. Natural Language Processing (NLP) pipelines are successful within specific domains (i.e., contexts). To limit the number of domain-specific NLP systems, a natural focus would be on sublanguages. Sublanguages are identified by shared lexical and semantic features.[1] Patterson and Hurdle[2] developed a sublanguage identification system that functioned well for 12 clinical specialties at the University of Utah. The current work compares sublanguages across institutions. Using a clinical NLP pipeline augmented by a new document corpus from the University of Pittsburg (UPitt), new documents were assigned to clusters based on the minimum cosine-distance to a Utah cluster centroid. The UPitt documents were divided into a nine-group specialty corpus. Across institutions, five of the specialty groups fell within the expected clusters. We find that clustering encounters difficulty due to documents with mixed sublanguages; naming convention differences across institutions; and document types used across specialties. The findings indicate that clinical specialty sublanguages can be identified across institutions.