Text compression
Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
Foundations of statistical natural language processing
Foundations of statistical natural language processing
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text genre classification with genre-revealing and subject-revealing features
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization Based on Regularized Linear Classification Methods
Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Optimized Substructure Discovery for Semi-structured Data
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
On Machine Learning Methods for Chinese Document Categorization
Applied Intelligence
Efficiently mining frequent trees in a forest
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text Categorization Using Compression Models
DCC '00 Proceedings of the Conference on Data Compression
Text classification using string kernels
The Journal of Machine Learning Research
Augmenting Naive Bayes Classifiers with Statistical Language Models
Information Retrieval
Automatic detection of text genre
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Training linear SVMs in linear time
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting key-substring-group features for text classification
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Reducing the human overhead in text categorization
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Transductive learning for text classification using explicit knowledge models
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Mining positive and negative patterns for relevance feature discovery
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
The Journal of Machine Learning Research
The bag-of-opinions method for review rating prediction from sparse text patterns
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Mining electronic health records for adverse drug effects using regression based methods
Proceedings of the 1st ACM International Health Informatics Symposium
Sparse substring pattern set discovery using linear programming boosting
DS'10 Proceedings of the 13th international conference on Discovery science
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Information Technology and Management
Hi-index | 0.00 |
A common representation used in text categorization is the bag of words model (aka. unigram model). Learning with this particular representation involves typically some preprocessing, e.g. stopwords-removal, stemming. This results in one explicit tokenization of the corpus. In this work, we introduce a logistic regression approach where learning involves automatic tokenization. This allows us to weaken the a-priori required knowledge about the corpus and results in a tokenization with variable-length (word or character) n-grams as basic tokens. We accomplish this by solving logistic regression using gradient ascent in the space of all ngrams. We show that this can be done very efficiently using a branch and bound approach which chooses the maximum gradient ascent direction projected onto a single dimension (i.e., candidate feature). Although the space is very large, our method allows us to investigate variable-length n-gram learning. We demonstrate the efficiency of our approach compared to state-of-the-art classifiers used for text categorization such as cyclic coordinate descent logistic regression and support vector machines.