Distributional word clusters vs. words for text categorization

Authors:
Ron Bekkerman;Ran El-Yaniv;Naftali Tishby;Yoad Winter
Affiliations:
Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;School of Computer Science and Engineering and Center for Neural Computation, The Hebrew University, Jerusalem 91904, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel
Venue:
The Journal of Machine Learning Research
Year:
2003

Citing 31
Cited 46

Elements of information theory

Elements of information theory
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Support-Vector Networks

Machine Learning
Improved boosting algorithms using confidence-rated predictions

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Making large-scale support vector machine learning practical

Advances in kernel methods
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Unsupervised document classification using sequential information maximization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Maximizing Text-Mining Performance

IEEE Intelligent Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Iterative Double Clustering for Unsupervised and Semi-supervised Learning

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Multivariate Information Bottleneck

UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
Text classification using string kernels

The Journal of Machine Learning Research
Round robin classification

The Journal of Machine Learning Research
Joining statistics with NLP for text categorization

ANLC '92 Proceedings of the third conference on Applied natural language processing
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics

An introduction to variable and feature selection

The Journal of Machine Learning Research
Multi-way distributional clustering via pairwise interactions

ICML '05 Proceedings of the 22nd international conference on Machine learning
Generalized LARS as an effective feature selection tool for text classification with SVMs

ICML '05 Proceedings of the 22nd international conference on Machine learning
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
A scaleable document clustering approach for large document corpora

Information Processing and Management: an International Journal
A relevance feedback mechanism for cluster-based retrieval

Information Processing and Management: an International Journal
A New Text Categorization Technique Using Distributional Clustering and Learning Logic

IEEE Transactions on Knowledge and Data Engineering
Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Proceedings of the 2006 ACM symposium on Document engineering
A semi-supervised feature clustering algorithm with application to word sense disambiguation

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Searching with style: authorship attribution in classic literature

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Using clustering to enhance text classification

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling bug report quality

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Text Categorization in Non-linear Semantic Space

AI*IA '07 Proceedings of the 10th Congress of the Italian Association for Artificial Intelligence on AI*IA 2007: Artificial Intelligence and Human-Oriented Computing
Text Categorization Using Fuzzy Proximal SVM and Distributional Clustering of Words

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Learning non-redundant codebooks for classifying complex objects

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Improving text categorization bootstrapping via unsupervised learning

ACM Transactions on Speech and Language Processing (TSLP)
Legal docket-entry classification: where machine learning stumbles

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Soft-supervised learning for text classification

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Supervised latent semantic indexing using adaptive sprinkling

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Exploiting term relationship to boost text classification

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient Text Classification Using Term Projection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Domain kernels for text categorization

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
Entropy-based authorship search in large document collections

ECIR'07 Proceedings of the 29th European conference on IR research
Document clustering of scientific texts using citation contexts

Information Retrieval
A clustering scheme for large high-dimensional document datasets

ISICA'07 Proceedings of the 2nd international conference on Advances in computation and intelligence
Object-based image retrieval beyond visual appearances

MMM'08 Proceedings of the 14th international conference on Advances in multimedia modeling
Document classification algorithm based on IB and LS-SVM

IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
A comparison study on multiple binary-class SVM methods for unilabel text categorization

Pattern Recognition Letters
Quadratic Programming Feature Selection

The Journal of Machine Learning Research
Exploiting word cluster information for unsupervised feature selection

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
High-precision phrase-based document classification on a modern scale

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-Supervised Learning with Measure Propagation

The Journal of Machine Learning Research
Combinatorial markov random fields

ECML'06 Proceedings of the 17th European conference on Machine Learning
Distributional features for text categorization

ECML'06 Proceedings of the 17th European conference on Machine Learning
A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition

Knowledge-Based Systems
A new inductive learning method for multilabel text categorization

IEA/AIE'06 Proceedings of the 19th international conference on Advances in Applied Artificial Intelligence: industrial, Engineering and Other Applications of Applied Intelligent Systems
Automatic word clustering for text categorization using global information

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Feature selection for dimensionality reduction

SLSFS'05 Proceedings of the 2005 international conference on Subspace, Latent Structure and Feature Selection
A global-ranking local feature selection method for text categorization

Expert Systems with Applications: An International Journal
Online feature selection for mining big data

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Ontology-guided feature engineering for clinical text classification

Journal of Biomedical Informatics
Control-flow integrity principles, implementations, and applications

ACM Transactions on Information and System Security (TISSEC)
A survey on feature selection methods

Computers and Electrical Engineering
A scatter method for data and variable importance evaluation

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural differences between the datasets.