From sBoW to dCoT marginalized encoders for text representation

Authors:
Zhixiang (Eddie) Xu;Minmin Chen;Kilian Q. Weinberger;Fei Sha
Affiliations:
Washington University in St. Louis, St. Louis, MO, USA;Washington University in St. Louis, St. Louis, MO, USA;Washington University in St. Louis, St. Louis, MO, USA;University of Southern California, Los Angeles, CA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 13
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
A vector space model for automatic indexing

Communications of the ACM
Latent dirichlet allocation

The Journal of Machine Learning Research
Learning a kernel matrix for nonlinear dimensionality reduction

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Extracting and composing robust features with denoising autoencoders

Proceedings of the 25th international conference on Machine learning
Probabilistic dyadic data analysis with local and global consistency

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Feature hashing for large scale multitask learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Supervised semantic indexing

Proceedings of the 18th ACM conference on Information and knowledge management
A statistical approach to mechanized encoding and searching of literary information

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.