Evolving better stoplists for document clustering and web intelligence

  • Authors:
  • Mark P. Sinka;David W. Corne

  • Affiliations:
  • Department of Computer Science, University of Reading, Reading, RG6 6AY, UK;Department of Computer Science, University of Exeter, UK and Department of Computer Science, University of Reading, Reading, RG6 6AY, UK

  • Venue:
  • Design and application of hybrid intelligent systems
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text classification, document clustering and similar document analysis tasks are currently the subject of significant global research, since such areas underpin web intelligence, web mining, search engine design, and so forth. A fundamental tool in such document analysis tasks is a list of so-called 'stop' words, called a 'stoplist'. A stoplist is a specific collection of so-called 'noise' words, which tend to appear frequently in documents, but are believed to carry no usable information which would aid learning tasks, and so the idea is that the words in the stoplist are removed from the documents concerned before processing begins. It is well-known that the results of document classification experiments (for example) are invariably considerably improved when a stoplist is employed. Current stoplists in regular use are, however, rather outdated. We have explored this claim in recent work which produced new stoplists based on word-entropy over modem collections of documents. In this work we introduce the notion of optimising a stoplist, and use stochastic search in conjunction with k-means clustering to converge on stoplists which lead to better performance on certain tasks than any previously published.