Scalable document classification

  • Authors:
  • Jae-Moon Lee;Rafael A. Calvo

  • Affiliations:
  • School of Information and Computer Engineering, Hansung University, Korea and Web Engineering Group, School of Electrical and Information Engineering, University of Sydney, Australia. E-mail: rafa ...;Web Engineering Group, School of Electrical and Information Engineering, University of Sydney, Australia. E-mail: rafa@ee.usyd.edu.au

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes the design and implementation of new naive Bayes and k-Nearest Neighbour methods that are highly scalable and efficient for document classification. Three methods for improving scalability are analysed: a change in the data representation and therefore in the algorithms' implementation, a partitioning mechanism that breaks down the problem into smaller parts, and a buffering mechanism to improve memory efficiency for large datasets. The classifiers were tested over two Reuters datasets: ModApte a popular but small benchmark, and RCV1 a new large collection of news stories, and compared to more standard implementations of these methods, both experimentally and analitically.