Distributional lexical semantics for stop lists

Authors:
Mr Neil Cooke;Lee Gillam
Affiliations:
University of Surrey;University of Surrey
Venue:
IRSG'08 Proceedings of the 2008 BCS-IRSG conference on Corpus Profiling
Year:
2008

Citing 11
Cited 0

The automatic identification of stop words

Journal of Information Science
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Latent dirichlet allocation

The Journal of Machine Learning Research
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Lexical semantic techniques for corpus analysis

Computational Linguistics - Special issue on using large corpora: II
Evolving better stoplists for document clustering and web intelligence

Design and application of hybrid intelligent systems
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

The Journal of Machine Learning Research
Towards the Orwellian nightmare: separation of business and personal emails

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
IP protection: Detecting Email based breaches of confidence

IAS '07 Proceedings of the Third International Symposium on Information Assurance and Security
Identifying synonymous concepts in preparation for technology mining

Journal of Information Science
Pattern mining across domain-specific text collections

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider the use of techniques that lead naturally towards using distributional lexical semantics for the automatic construction of corpora-specific stop word lists. We propose and evaluate a method for calculating stop words based on collocation, frequency information and comparisons of distributions within and across samples. This method is tested against the Enron email corpus and the MuchMore Springer Bilingual Corpus of medical abstracts. We identify some of the data cleansing challenges related to the Enron corpus, and particularly how these necessarily relate to the profile of a corpus. We further consider how we can and should investigate behaviours of subsamples of such a corpus to ascertain whether the lexical semantic techniques employed might be used to identify and classify variations in contextual use of keywords that may help towards content separation in "unclean" collections: the challenge here is the separation of keywords in the same or very similar contexts, that may be conceived as a "pragmatic difference". Such work may also be applicable to initiatives in which the focus is on constructing (clean) corpora from the web, deriving knowledge resources from wikis, and finding key information within other textual social media.