Fast logistic regression for text categorization with variable-length n-grams

Authors:
Georgiana Ifrim;Gökhan Bakir;Gerhard Weikum
Affiliations:
Max-Planck Institute for Informatics, Saarbrücken, Germany;Google Switzerland GmbH, Zürich, Switzerland;Max-Planck Institute for Informatics, Saarbrücken, Germany
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 19
Cited 7

Text compression

Text compression
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text genre classification with genre-revealing and subject-revealing features

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Optimized Substructure Discovery for Semi-structured Data

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
On Machine Learning Methods for Chinese Document Categorization

Applied Intelligence
Efficiently mining frequent trees in a forest

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
Text classification using string kernels

The Journal of Machine Learning Research
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting key-substring-group features for text classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Reducing the human overhead in text categorization

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Transductive learning for text classification using explicit knowledge models

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Mining positive and negative patterns for relevance feature discovery

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

The Journal of Machine Learning Research
The bag-of-opinions method for review rating prediction from sparse text patterns

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Mining electronic health records for adverse drug effects using regression based methods

Proceedings of the 1st ACM International Health Informatics Symposium
Sparse substring pattern set discovery using linear programming boosting

DS'10 Proceedings of the 13th international conference on Discovery science
Bounded coordinate-descent for biological sequence classification in high dimensional predictor space

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Analyzing sentiments in Web 2.0 social media data in Chinese: experiments on business and marketing related Chinese Web forums

Information Technology and Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common representation used in text categorization is the bag of words model (aka. unigram model). Learning with this particular representation involves typically some preprocessing, e.g. stopwords-removal, stemming. This results in one explicit tokenization of the corpus. In this work, we introduce a logistic regression approach where learning involves automatic tokenization. This allows us to weaken the a-priori required knowledge about the corpus and results in a tokenization with variable-length (word or character) n-grams as basic tokens. We accomplish this by solving logistic regression using gradient ascent in the space of all ngrams. We show that this can be done very efficiently using a branch and bound approach which chooses the maximum gradient ascent direction projected onto a single dimension (i.e., candidate feature). Although the space is very large, our method allows us to investigate variable-length n-gram learning. We demonstrate the efficiency of our approach compared to state-of-the-art classifiers used for text categorization such as cyclic coordinate descent logistic regression and support vector machines.