An example-based mapping method for text categorization and retrieval
ACM Transactions on Information Systems (TOIS)
Foundations of statistical natural language processing
Foundations of statistical natural language processing
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Building application frameworks: object-oriented foundations of framework design
Building application frameworks: object-oriented foundations of framework design
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Intelligent document classification
Intelligent Data Analysis
Searching the Web: general and scientific information access
IEEE Communications Magazine
Hi-index | 0.00 |
This paper describes the design and implementation of new naive Bayes and k-Nearest Neighbour methods that are highly scalable and efficient for document classification. Three methods for improving scalability are analysed: a change in the data representation and therefore in the algorithms' implementation, a partitioning mechanism that breaks down the problem into smaller parts, and a buffering mechanism to improve memory efficiency for large datasets. The classifiers were tested over two Reuters datasets: ModApte a popular but small benchmark, and RCV1 a new large collection of news stories, and compared to more standard implementations of these methods, both experimentally and analitically.