The BankSearch web document dataset: investigating unsupervised clustering and category similarity

Authors:
Mark P. Sinka;David W. Corne
Affiliations:
Department of Computer Science, University of Reading, P.O. Box 225, Whiteknights, Reading RG6 6AY, UK;Department of Computer Science, Harrison Building, University of Exeter, Exeter EX4 4QF, UK
Venue:
Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
Year:
2005

Citing 12
Cited 4

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Little words can make a big difference for text classification

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Information Retrieval

Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Automatic Web-Page Classification by Using Machine Learning Methods

WI '01 Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development
Machine learning in automated text categorisation

Machine learning in automated text categorisation

Designing evolving user profile in e-CRM with dynamic clustering of Web documents

Data & Knowledge Engineering
Pairwise-adaptive dissimilarity measure for document clustering

Information Sciences: an International Journal
Research of fast SOM clustering for text information

Expert Systems with Applications: An International Journal
Fast growing self organizing map for text clustering

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Targeting useful and relevant information on the internet is a highly complicated research area, which is served in part by research into document clustering. A foundational aspect of such research (proven over and over again in other research disciplines) is the use of standard datasets, against which different techniques can be properly benchmarked and assessed. We argue that, so far in this broad area of research, as many datasets have been used as research papers written, thus preventing confident reasoning about the relative performance of different techniques used in different publications. We describe a solution to this problem with the compilation of the BankSearch dataset, a proposed standard dataset suitable for a wide range of web-intelligence related research activities. At the time of writing, this dataset has already become a popular download in the Statlib archive, and is in use for benchmarking of a variety of document processing and web search techniques. Herein we also use the dataset in experiments to investigate certain issues in unsupervised web document clustering. Our main interest is how unsupervised clustering performance varies with the relative 'distance' between the categories inherent in the data, and how this is affected by the use of stemming and stoplists. These issues relate to, among other things, the design of useful search engines. We use simple K-means clustering, and find, unsurprisingly, that performance improves as categories become more distant. However, we also find that very close categories can be distinguished with fair accuracy, and there are interesting results concerning the use of stemming. Stop-word removal is confirmed as universally helpful, but stemming is not always to be recommended on 'distant' categories.