A scalability analysis of classifiers in text categorization

Authors:
Yiming Yang;Jian Zhang;Bryan Kisiel
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Year:
2003

Citing 12
Cited 44

Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Noise reduction in a statistical approach to text categorization

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
On power-law relationships of the Internet topology

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Vol. 1

Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Vol. 1
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning

On redundancy of training corpus for text categorization: a perspective of geometry

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical Taxonomy Preparation for Text Categorization Using Consistent Bipartite Spectral Graph Copartitioning

IEEE Transactions on Knowledge and Data Engineering
Support vector machines classification with a very large-scale taxonomy

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
An analysis of the coupling between training set and neighborhood sizes for the kNN classifier

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Acclimatizing Taxonomic Semantics for Hierarchical Content Classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Automated extraction of behavioural profiles from document usage

BT Technology Journal
Automatic Ontology Generation Using Schema Information

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
A study of local and global thresholding techniques in text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Reconstructing ddc for interactive classification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Topic taxonomy adaptation for group profiling

ACM Transactions on Knowledge Discovery from Data (TKDD)
Deep classifier: automatically categorizing search results into large-scale hierarchies

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Using ambiguity measure feature selection algorithm for support vector machine classifier

Proceedings of the 2008 ACM symposium on Applied computing
Text classification: a recent overview

ICCOMP'05 Proceedings of the 9th WSEAS International Conference on Computers
Boosting multi-label hierarchical text categorization

Information Retrieval
Integrating Cross-Language Hierarchies and Its Application to Retrieving Relevant Documents

ACM Transactions on Asian Language Information Processing (TALIP)
Deep classification in large-scale text hierarchies

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
The study of drug-reaction relationships using global optimization techniques

Optimization Methods & Software - Systems Analysis, Optimization and Data Mining in Biomedicine
Boosting RVM Classifiers for Large Data Sets

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part II
Discovering Knowledge in a Large Organization through Support Vector Machines

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Error-driven generalist+experts (edge): a multi-stage ensemble framework for text categorization

Proceedings of the 17th ACM conference on Information and knowledge management
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Ontology Construction Based on Latent Topic Extraction in a Digital Library

ICADL 08 Proceedings of the 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information
A hidden Markov model-based text classification of medical documents

Journal of Information Science
Simple but Effective Porn Query Recognition by k-NN with Semantic Similarity Measure

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Associative Naïve Bayes classifier: Automated linking of gene ontology to medline documents

Pattern Recognition
A hierarchical approach to encoding medical concepts for clinical notes

HLT-SRWS '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Student Research Workshop
Preferential text classification: learning algorithms and evaluation measures

Information Retrieval
An extensive study on automated Dewey Decimal Classification

Journal of the American Society for Information Science and Technology
Agent-assisted task management that reduces email overload

Proceedings of the 15th international conference on Intelligent user interfaces
Does SVM really scale up to large bag of words feature spaces?

IDA'07 Proceedings of the 7th international conference on Intelligent data analysis
Combining global and local information for enhanced deep classification

Proceedings of the 2010 ACM Symposium on Applied Computing
The ECIR 2010 large scale hierarchical classification workshop

ACM SIGIR Forum
Text classification for a large-scale taxonomy using dynamically mixed local and global models for a node

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Feature sub-set selection metrics for Arabic text classification

Pattern Recognition Letters
A soft real-time web news classification system with double control loops

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
A term weighting approach for text categorization

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Systematic construction of hierarchical classifier in SVM-Based text categorization

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Towards automatic concept hierarchy generation for specific knowledge network

IEA/AIE'06 Proceedings of the 19th international conference on Advances in Applied Artificial Intelligence: industrial, Engineering and Other Applications of Applied Intelligent Systems
TreeBoost.MH: a boosting algorithm for multi-label hierarchical text categorization

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
CONDOCS: a concept-based document categorization system using concept-probability vector with thesaurus

AIS'04 Proceedings of the 13th international conference on AI, Simulation, and Planning in High Autonomy Systems
On the behavior of SVM and some older algorithms in binary text classification tasks

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Automated learning of RVM for large scale text sets: divide to conquer

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Recursive regularization for large-scale classification with hierarchical and graphical dependencies

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Arabic Text Categorization Based on Arabic Wikipedia

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Real-world applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, k-nearest neighbor, ridge regression, linear least square fit and logistic regression. By providing a formal analysis of the computational complexity of each classification method, followed by an investigation on the usage of different classifiers in a hierarchical setting of categorization, we show how the scalability of a method depends on the topology of the hierarchy and the category distributions. In addition, we are able to obtain tight bounds for the complexities by using the power law to approximate category distributions over a hierarchy. Experiments with kNN and SVM classifiers on the OHSUMED corpus are reported on, as concrete examples.