SimSem: fast approximate string matching in relation to semantic category disambiguation

  • Authors:
  • Pontus Stenetorp;Sampo Pyysalo;Jun'ichi Tsujii

  • Affiliations:
  • The University of Tokyo, Tokyo, Japan;The University of Tokyo, Tokyo, Japan;Microsoft Research Asia, Beijing, People's Republic of China

  • Venue:
  • BioNLP '11 Proceedings of BioNLP 2011 Workshop
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this study we investigate the merits of fast approximate string matching to address challenges relating to spelling variants and to utilise large-scale lexical resources for semantic class disambiguation. We integrate string matching results into machine learning-based disambiguation through the use of a novel set of features that represent the distance of a given textual span to the closest match in each of a collection of lexical resources. We collect lexical resources for a multitude of semantic categories from a variety of biomedical domain sources. The combined resources, containing more than twenty million lexical items, are queried using a recently proposed fast and efficient approximate string matching algorithm that allows us to query large resources without severely impacting system performance. We evaluate our results on six corpora representing a variety of disambiguation tasks. While the integration of approximate string matching features is shown to substantially improve performance on one corpus, results are modest or negative for others. We suggest possible explanations and future research directions. Our lexical resources and implementation are made freely available for research purposes at: http://github.com/ninjin/simsem