The automatic identification of stop words
Journal of Information Science
Foundations of statistical natural language processing
Foundations of statistical natural language processing
The Journal of Machine Learning Research
Retrieving collocations from text: Xtract
Computational Linguistics - Special issue on using large corpora: I
Lexical semantic techniques for corpus analysis
Computational Linguistics - Special issue on using large corpora: II
Evolving better stoplists for document clustering and web intelligence
Design and application of hybrid intelligent systems
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images
The Journal of Machine Learning Research
Towards the Orwellian nightmare: separation of business and personal emails
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
IP protection: Detecting Email based breaches of confidence
IAS '07 Proceedings of the Third International Symposium on Information Assurance and Security
Identifying synonymous concepts in preparation for technology mining
Journal of Information Science
Pattern mining across domain-specific text collections
MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Hi-index | 0.00 |
In this paper, we consider the use of techniques that lead naturally towards using distributional lexical semantics for the automatic construction of corpora-specific stop word lists. We propose and evaluate a method for calculating stop words based on collocation, frequency information and comparisons of distributions within and across samples. This method is tested against the Enron email corpus and the MuchMore Springer Bilingual Corpus of medical abstracts. We identify some of the data cleansing challenges related to the Enron corpus, and particularly how these necessarily relate to the profile of a corpus. We further consider how we can and should investigate behaviours of subsamples of such a corpus to ascertain whether the lexical semantic techniques employed might be used to identify and classify variations in contextual use of keywords that may help towards content separation in "unclean" collections: the challenge here is the separation of keywords in the same or very similar contexts, that may be conceived as a "pragmatic difference". Such work may also be applicable to initiatives in which the focus is on constructing (clean) corpora from the web, deriving knowledge resources from wikis, and finding key information within other textual social media.