An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

  • Authors:
  • Pavel Rychlý;Adam Kilgarriff

  • Affiliations:
  • Masaryk University, Brno, Czech Republic;Lexical Computing Ltd, Brighton, UK

  • Venue:
  • ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Gorman and Curran (2006) argue that thesaurus generation for billion+-word corpora is problematic as the full computation takes many days. We present an algorithm with which the computation takes under two hours. We have created, and made publicly available, thesauruses based on large corpora for (at time of writing) seven major world languages. The development is implemented in the Sketch Engine (Kilgarriff et al., 2004). Another innovative development in the same tool is the presentation of the grammatical behaviour of a word against the background of how all other words of the same word class behave. Thus, the English noun constraint occurs 75% in the plural. Is this a salient lexical fact? To form a judgement, we need to know the distribution for all nouns. We use histograms to present the distribution in a way that is easy to grasp.