A word clustering approach to domain adaptation: effective parsing of biomedical texts

  • Authors:
  • Marie Candito;Enrique Henestroza Anguiano;Djamé Seddah

  • Affiliations:
  • Alpage (Univ. Paris Diderot & INRIA), Paris, France;Alpage (Univ. Paris Diderot & INRIA), Paris, France;Alpage (Univ. Paris Diderot & INRIA), Paris, France, and Univ. Paris Sorbonne, Paris, France

  • Venue:
  • IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a simple and effective way to perform out-of-domain statistical parsing by drastically reducing lexical data sparseness in a PCFG-LA architecture. We replace terminal symbols with unsupervised word clusters acquired from a large newspaper corpus augmented with biomedical targetdomain data. The resulting clusters are effective in bridging the lexical gap between source-domain and target-domain vocabularies. Our experiments combine known self-training techniques with unsupervised word clustering and produce promising results, achieving an error reduction of 21% on a new evaluation set for biomedical text with manual bracketing annotations.