Anaphoric reference in clinical reports: Characteristics of an annotated corpus

  • Authors:
  • Wendy W. Chapman;Guergana K. Savova;Jiaping Zheng;Melissa Tharp;Rebecca Crowley

  • Affiliations:
  • University of California, San Diego, Division of Biomedical Informatics, 9500 Gillman Drive #0505, La Jolla, CA 92093, United States;Children's Hospital Boston and Harvard Medical School, Boston, MA 02114, United States;University of Massachusetts Amherst, 140 Governors Drive, Amherst, MA 01003-9264, United States;University of California, San Diego, Division of Biomedical Informatics, 9500 Gillman Drive #0505, La Jolla, CA 92093, United States;University of Pittsburgh, Department of Biomedical Informatics, Pittsburgh, PA 15260, United States

  • Venue:
  • Journal of Biomedical Informatics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Motivation: Expressions that refer to a real-world entity already mentioned in a narrative are often considered anaphoric. For example, in the sentence ''The pain comes and goes,'' the expression ''the pain'' is probably referring to a previous mention of pain. Interpretation of meaning involves resolving the anaphoric reference: deciding which expression in the text is the correct antecedent of the referring expression, also called an anaphor. We annotated a set of 180 clinical reports (surgical pathology, radiology, discharge summaries, and emergency department) from two institutions to indicate all anaphor-antecedent pairs. Objective: The objective of this study is to describe the characteristics of the corpus in terms of the frequency of anaphoric relations, the syntactic and semantic nature of the members of the pairs, and the types of anaphoric relations that occur. Understanding how anaphoric reference is exhibited in clinical reports is critical to developing reference resolution algorithms and to identifying peculiarities of clinical text that may alter the features and methodologies that will be successful for automated anaphora resolution. Results: We found that anaphoric reference is prevalent in all types of clinical reports, that annotations of noun phrases, semantic type, and section headings may be especially important for automated resolution of anaphoric reference, and that separate modules for reference resolution may be required for different report types, different institutions, and different types of anaphors. Accurate resolution will probably require extensive domain knowledge-especially for pathology and radiology reports with more part/whole and set/subset relations. Conclusion: We hope researchers will leverage the annotations in this corpus to develop automated algorithms and will add to the annotations to generate a more extensive corpus.