On the use of linear programming for unsupervised text classification

Authors:
Mark Sandler
Affiliations:
Cornell University, Ithaca, NY
Venue:
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Year:
2005

Citing 19
Cited 6

A multiple cause mixture model for unsupervised learning

Neural Computation
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Spectral analysis of data

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Correlation Clustering

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
On the Eigenvalue Power Law

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
On the use of the singular value decomposition for text retrieval

Computational information retrieval
Enhanced word clustering for hierarchical text classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Latent dirichlet allocation

The Journal of Machine Learning Research
Correlation Clustering: maximizing agreements via semidefinite programming

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Using mixture models for collaborative filtering

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Feature selection, L1 vs. L2 regularization, and rotational invariance

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Spectral Analysis of Random Graphs with Skewed Degree Distributions

FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Hierarchical mixture models: a probabilistic analysis

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Using mixture models for collaborative filtering

Journal of Computer and System Sciences
Unsupervised Text Learning Based on Context Mixture Model with Dirichlet Prior

Advanced Web and NetworkTechnologies, and Applications
Which clustering do you want? inducing your ideal clustering with minimal feedback

Journal of Artificial Intelligence Research
Applying machine learning in accounting research

Expert Systems with Applications: An International Journal
Towards the taxonomy-oriented categorization of yellow pages queries

ACM Transactions on Internet Technology (TOIT)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new algorithm for dimensionality reduction and unsupervised text classification. We use mixture models as underlying process of generating corpus and utilize a novel, L1-norm based approach introduced by Kleinberg and Sandler [19]. We show that our algorithm performs extremely well on large datasets, with peak accuracy approaching that of supervised learning based on Support Vector Machines (SVMs) with large training sets. The method is based on the same idea that underlies Latent Semantic Indexing (LSI). We find a good low-dimensional subspace of a feature space and project all documents into it. However our projection minimizes different error, and unlike LSI we build a basis, that in many cases corresponds to the actual topics. We present results of testing of our algorithm on the abstracts of arXiv - an electronic repository of scientific papers, and the 20 Newsgroup dataset - a small snapshot of 20 specific newsgroups.