Towards Modernised and Web-Specific Stoplists for Web Document Analysis

Authors:
Mark P. Sinka;David W. Corne
Affiliations:
-;-
Venue:
WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Year:
2003

Citing 0
Cited 3

Evolving better stoplists for document clustering and web intelligence

Design and application of hybrid intelligent systems
A lattice-based approach to query-by-example spoken document retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic extraction of domain-specific stopwords from labeled documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Research areas such as text classification and document clustering underpin many issues in web intelligence. A fundamental tool in document clustering is a list of stop' words (stoplist) that is used to identify frequent words that are unlikely to assist in classification and are hence removed during pre-processing. Current stoplists are outdated both in light of fluctuations in word usage, and innocent of web-specific' stop words, hence questioning their applicability in web-based tasks. We explore this by developing new word-entropy based stoplists: one derived from random web pages, and one derived from the BankSearch dataset. We evaluate these against other stoplists using accuracies obtained from unsupervised clustering experiments. We find that existing stoplists perform well, but are sometimes outperformed by our new stoplists, especially on hard classification tasks.