Finding new terminology in very large corpora

  • Authors:
  • Joachim Wermter;Udo Hahn

  • Affiliations:
  • Jena University Language and Information Engineering (JULIE) Lab, Jena, Germany;Jena University Language and Information Engineering (JULIE) Lab, Jena, Germany

  • Venue:
  • Proceedings of the 3rd international conference on Knowledge capture
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system. We also provide empirical evidence that the superiority of our approach, beyond a 10-million-word threshold, is essentially domain- and corpus-size-independent.