Thesaurus extension using web search engines

  • Authors:
  • Robert Meusel;Mathias Niepert;Kai Eckert;Heiner Stuckenschmidt

  • Affiliations:
  • KR & KM Research Group, University of Mannheim, Germany;KR & KM Research Group, University of Mannheim, Germany;KR & KM Research Group, University of Mannheim, Germany;KR & KM Research Group, University of Mannheim, Germany

  • Venue:
  • ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Maintaining and extending large thesauri is an important challenge facing digital libraries and IT businesses alike. In this paper we describe a method building on and extending existing methods from the areas of thesaurus maintenance, natural language processing, and machine learning to (a) extract a set of novel candidate concepts from text corpora and (b) to generate a small ranked list of suggestions for the position of these concept in an existing thesaurus. Based on a modification of the standard tf-idf term weighting we extract relevant concept candidates from a document corpus. We then apply a pattern-based machine learning approach on content extracted from web search engine snippets to determine the type of relation between the candidate terms and existing thesaurus concepts. The approach is evaluated with a largescale experiment using the MeSH and WordNet thesauri as testbed.