Adaptation in natural and artificial systems
Adaptation in natural and artificial systems
Lexical analysis and stoplists
Information retrieval
Genetic programming: on the programming of computers by means of natural selection
Genetic programming: on the programming of computers by means of natural selection
Communications of the ACM
Intelligence through simulated evolution: forty years of evolutionary programming
Intelligence through simulated evolution: forty years of evolutionary programming
Information Retrieval
Machine learning in automated text categorisation
Machine learning in automated text categorisation
Towards Modernised and Web-Specific Stoplists for Web Document Analysis
WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Automatic extraction of domain-specific stopwords from labeled documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Distributional lexical semantics for stop lists
IRSG'08 Proceedings of the 2008 BCS-IRSG conference on Corpus Profiling
Hi-index | 0.00 |
Text classification, document clustering and similar document analysis tasks are currently the subject of significant global research, since such areas underpin web intelligence, web mining, search engine design, and so forth. A fundamental tool in such document analysis tasks is a list of so-called 'stop' words, called a 'stoplist'. A stoplist is a specific collection of so-called 'noise' words, which tend to appear frequently in documents, but are believed to carry no usable information which would aid learning tasks, and so the idea is that the words in the stoplist are removed from the documents concerned before processing begins. It is well-known that the results of document classification experiments (for example) are invariably considerably improved when a stoplist is employed. Current stoplists in regular use are, however, rather outdated. We have explored this claim in recent work which produced new stoplists based on word-entropy over modem collections of documents. In this work we introduce the notion of optimising a stoplist, and use stochastic search in conjunction with k-means clustering to converge on stoplists which lead to better performance on certain tasks than any previously published.