Extracting key-substring-group features for text classification

Authors:
Dell Zhang;Wee Sun Lee
Affiliations:
University of London, London, UK;National University of Singapore, Singapore
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 49
Cited 8

Text compression

Text compression
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
The power of amnesia: learning probabilistic automata with variable memory length

Machine Learning - Special issue on COLT '94
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic acquisition of phrasal knowledge for English-Chinese bilingual information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
AI Game Programming Wisdom

AI Game Programming Wisdom
Information Retrieval

Information Retrieval
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Text genre classification with genre-revealing and subject-revealing features

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information

The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Discriminative Feature Selection via Multiclass Variable Memory Markov Model

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
On Machine Learning Methods for Chinese Document Categorization

Applied Intelligence
Text Mining: A New Frontier for Lossless Compression

DCC '99 Proceedings of the Conference on Data Compression
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
Text classification using string kernels

The Journal of Machine Learning Research
Word sequence kernels

The Journal of Machine Learning Research
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Automatic text categorization in terms of genre and author

Computational Linguistics
MARSYAS: a framework for audio analysis

Organised Sound
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Probability Product Kernels

The Journal of Machine Learning Research
Text classification with kernels on the multinomial manifold

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
On compression-based text classification

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
A comparison of methods for multiclass support vector machines

IEEE Transactions on Neural Networks

Fast logistic regression for text categorization with variable-length n-grams

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Tensor Space Models for Authorship Identification

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Author Identification Using a Tensor Space Representation

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Multimodal subjectivity analysis of multiparty conversation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A study of spam filtering using support vector machines

Artificial Intelligence Review
Feature subsumption for sentiment classification in multiple languages

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Language independent semantic kernels for short-text classification

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named key-substring-group features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with key-substring-group features can achieve outstanding performance for various text classification tasks.