An evaluation of phrasal and clustered representations on a text categorization task
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems (TOIS)
Little words can make a big difference for text classification
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Partitioning-based clustering for Web document categorization
Decision Support Systems - Special issue on WITS '97
Document Categorization and Query Generation on the World Wide WebUsing WebACE
Artificial Intelligence Review - Special issue on data mining on the Internet
Document clustering using word clusters via the information bottleneck method
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing
Communications of the ACM
Information Retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Automatic Web-Page Classification by Using Machine Learning Methods
WI '01 Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development
Machine learning in automated text categorisation
Machine learning in automated text categorisation
Designing evolving user profile in e-CRM with dynamic clustering of Web documents
Data & Knowledge Engineering
Pairwise-adaptive dissimilarity measure for document clustering
Information Sciences: an International Journal
Research of fast SOM clustering for text information
Expert Systems with Applications: An International Journal
Fast growing self organizing map for text clustering
ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part II
Hi-index | 0.00 |
Targeting useful and relevant information on the internet is a highly complicated research area, which is served in part by research into document clustering. A foundational aspect of such research (proven over and over again in other research disciplines) is the use of standard datasets, against which different techniques can be properly benchmarked and assessed. We argue that, so far in this broad area of research, as many datasets have been used as research papers written, thus preventing confident reasoning about the relative performance of different techniques used in different publications. We describe a solution to this problem with the compilation of the BankSearch dataset, a proposed standard dataset suitable for a wide range of web-intelligence related research activities. At the time of writing, this dataset has already become a popular download in the Statlib archive, and is in use for benchmarking of a variety of document processing and web search techniques. Herein we also use the dataset in experiments to investigate certain issues in unsupervised web document clustering. Our main interest is how unsupervised clustering performance varies with the relative 'distance' between the categories inherent in the data, and how this is affected by the use of stemming and stoplists. These issues relate to, among other things, the design of useful search engines. We use simple K-means clustering, and find, unsurprisingly, that performance improves as categories become more distant. However, we also find that very close categories can be distinguished with fair accuracy, and there are interesting results concerning the use of stemming. Stop-word removal is confirmed as universally helpful, but stemming is not always to be recommended on 'distant' categories.