Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering

  • Authors:
  • Bill Andreopoulos;Dimitra Alexopoulou;Michael Schroeder

  • Affiliations:
  • Biotechnological Centre, Technischen Universitat Dresden, Germany.;Biotechnological Centre, Technischen Universitat Dresden, Germany.;Biotechnological Centre, Technischen Universitat Dresden, Germany

  • Venue:
  • International Journal of Data Mining and Bioinformatics
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

With more and more genomes being sequenced, a lot of effort isdevoted to their annotation with terms from controlled vocabulariessuch as the GeneOntology. Manual annotation based on relevantliterature is tedious, but automation of this process is difficult.One particularly challenging problem is word sense disambiguation.Terms such as 'development' can refer to developmental biology orto the more general sense. Here, we present two approaches toaddress this problem by using term co-occurrences and documentclustering. To evaluate our method we defined a corpus of 331documents on development and developmental biology. Termco-occurrence analysis achieves an F-measure of 77%. Additionally,applying document clustering improves precision to 82%. We appliedthe same approach to disambiguate 'nucleus', 'transport', and'spindle', and we achieved consistent results. Thus, our method isa viable approach towards the automation of literature-based genomeannotation.