Learning to separate text content and style for classification

Authors:
Dell Zhang;Wee Sun Lee
Affiliations:
School of Computer Science and Information Systems, Birkbeck, University of London, London, UK;Department of Computer Science and Singapore-MIT Alliance, National University of Singapore, Singapore
Venue:
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Year:
2006

Citing 9
Cited 0

Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Model-based feedback in the language modeling approach to information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
Athena: Mining-Based Interactive Management of Text Database

EDBT '00 Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Cross-training: learning probabilistic mappings between topics

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Separating Style and Content with Bilinear Models

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many text documents naturally have two kinds of labels. For example, we may label web pages from universities according to their categories, such as “student” or “faculty”, or according the source universities, such as “Cornell” or “Texas”. We call one kind of labels the content and the other kind the style. Given a set of documents, each with both content and style labels, we seek to effectively learn to classify a set of documents in a new style with no content labels into its content classes. Assuming that every document is generated using words drawn from a mixture of two multinomial component models, one content model and one style model, we propose a method named Cartesian EM that constructs content models and style models through Expectation Maximization and performs classification of the unknown content classes transductively. Our experiments on real-world datasets show the proposed method to be effective for style independent text content classification.