Distributional lexical semantics for stop lists

  • Authors:
  • Mr Neil Cooke;Lee Gillam

  • Affiliations:
  • University of Surrey;University of Surrey

  • Venue:
  • IRSG'08 Proceedings of the 2008 BCS-IRSG conference on Corpus Profiling
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we consider the use of techniques that lead naturally towards using distributional lexical semantics for the automatic construction of corpora-specific stop word lists. We propose and evaluate a method for calculating stop words based on collocation, frequency information and comparisons of distributions within and across samples. This method is tested against the Enron email corpus and the MuchMore Springer Bilingual Corpus of medical abstracts. We identify some of the data cleansing challenges related to the Enron corpus, and particularly how these necessarily relate to the profile of a corpus. We further consider how we can and should investigate behaviours of subsamples of such a corpus to ascertain whether the lexical semantic techniques employed might be used to identify and classify variations in contextual use of keywords that may help towards content separation in "unclean" collections: the challenge here is the separation of keywords in the same or very similar contexts, that may be conceived as a "pragmatic difference". Such work may also be applicable to initiatives in which the focus is on constructing (clean) corpora from the web, deriving knowledge resources from wikis, and finding key information within other textual social media.