Evolving better stoplists for document clustering and web intelligence

Authors:
Mark P. Sinka;David W. Corne
Affiliations:
Department of Computer Science, University of Reading, Reading, RG6 6AY, UK;Department of Computer Science, University of Exeter, UK and Department of Computer Science, University of Reading, Reading, RG6 6AY, UK
Venue:
Design and application of hybrid intelligent systems
Year:
2003

Citing 8
Cited 2

Adaptation in natural and artificial systems

Adaptation in natural and artificial systems
Lexical analysis and stoplists

Information retrieval
Genetic programming: on the programming of computers by means of natural selection

Genetic programming: on the programming of computers by means of natural selection
To decode short cryptograms

Communications of the ACM
Intelligence through simulated evolution: forty years of evolutionary programming

Intelligence through simulated evolution: forty years of evolutionary programming
Information Retrieval

Information Retrieval
Machine learning in automated text categorisation

Machine learning in automated text categorisation
Towards Modernised and Web-Specific Stoplists for Web Document Analysis

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence

Automatic extraction of domain-specific stopwords from labeled documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Distributional lexical semantics for stop lists

IRSG'08 Proceedings of the 2008 BCS-IRSG conference on Corpus Profiling

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text classification, document clustering and similar document analysis tasks are currently the subject of significant global research, since such areas underpin web intelligence, web mining, search engine design, and so forth. A fundamental tool in such document analysis tasks is a list of so-called 'stop' words, called a 'stoplist'. A stoplist is a specific collection of so-called 'noise' words, which tend to appear frequently in documents, but are believed to carry no usable information which would aid learning tasks, and so the idea is that the words in the stoplist are removed from the documents concerned before processing begins. It is well-known that the results of document classification experiments (for example) are invariably considerably improved when a stoplist is employed. Current stoplists in regular use are, however, rather outdated. We have explored this claim in recent work which produced new stoplists based on word-entropy over modem collections of documents. In this work we introduce the notion of optimising a stoplist, and use stochastic search in conjunction with k-means clustering to converge on stoplists which lead to better performance on certain tasks than any previously published.