The BankSearch web document dataset: investigating unsupervised clustering and category similarity

  • Authors:
  • Mark P. Sinka;David W. Corne

  • Affiliations:
  • Department of Computer Science, University of Reading, P.O. Box 225, Whiteknights, Reading RG6 6AY, UK;Department of Computer Science, Harrison Building, University of Exeter, Exeter EX4 4QF, UK

  • Venue:
  • Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Targeting useful and relevant information on the internet is a highly complicated research area, which is served in part by research into document clustering. A foundational aspect of such research (proven over and over again in other research disciplines) is the use of standard datasets, against which different techniques can be properly benchmarked and assessed. We argue that, so far in this broad area of research, as many datasets have been used as research papers written, thus preventing confident reasoning about the relative performance of different techniques used in different publications. We describe a solution to this problem with the compilation of the BankSearch dataset, a proposed standard dataset suitable for a wide range of web-intelligence related research activities. At the time of writing, this dataset has already become a popular download in the Statlib archive, and is in use for benchmarking of a variety of document processing and web search techniques. Herein we also use the dataset in experiments to investigate certain issues in unsupervised web document clustering. Our main interest is how unsupervised clustering performance varies with the relative 'distance' between the categories inherent in the data, and how this is affected by the use of stemming and stoplists. These issues relate to, among other things, the design of useful search engines. We use simple K-means clustering, and find, unsurprisingly, that performance improves as categories become more distant. However, we also find that very close categories can be distinguished with fair accuracy, and there are interesting results concerning the use of stemming. Stop-word removal is confirmed as universally helpful, but stemming is not always to be recommended on 'distant' categories.