Scalable document classification

Authors:
Jae-Moon Lee;Rafael A. Calvo
Affiliations:
School of Information and Computer Engineering, Hansung University, Korea and Web Engineering Group, School of Electrical and Information Engineering, University of Sydney, Australia. E-mail: rafa ...;Web Engineering Group, School of Electrical and Information Engineering, University of Sydney, Australia. E-mail: rafa@ee.usyd.edu.au
Venue:
Intelligent Data Analysis
Year:
2005

Citing 8
Cited 0

An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Building application frameworks: object-oriented foundations of framework design

Building application frameworks: object-oriented foundations of framework design
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Intelligent document classification

Intelligent Data Analysis
Searching the Web: general and scientific information access

IEEE Communications Magazine

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the design and implementation of new naive Bayes and k-Nearest Neighbour methods that are highly scalable and efficient for document classification. Three methods for improving scalability are analysed: a change in the data representation and therefore in the algorithms' implementation, a partitioning mechanism that breaks down the problem into smaller parts, and a buffering mechanism to improve memory efficiency for large datasets. The classifiers were tested over two Reuters datasets: ModApte a popular but small benchmark, and RCV1 a new large collection of news stories, and compared to more standard implementations of these methods, both experimentally and analitically.