Summarization as feature selection for document categorization on small datasets

Authors:
Emmanuel Anguiano-Hernández;Luis Villaseñor-Pineda;Manuel Montes-y-Gómez;Paolo Rosso
Affiliations:
Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics, Mexico;Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics, Mexico;Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics, Mexico;Natural Language Engineering Lab, ELiRF, Department of Information Systems and Computation, Polytechnic University of Valencia, Spain
Venue:
IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Year:
2010

Citing 7
Cited 1

Summarization as feature selection for text categorization

Proceedings of the tenth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving text categorization using the importance of sentences

Information Processing and Management: an International Journal
A text categorization based on summarization technique

RANLPIR '00 Proceedings of the ACL-2000 workshop on Recent advances in natural language processing and information retrieval: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 11
Graph-based ranking algorithms for sentence extraction, applied to text summarization

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Noise reduction through summarization for Web-page classification

Information Processing and Management: an International Journal

A document is known by the company it keeps: neighborhood consensus for short text categorization

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most common feature selection techniques for document categorization are supervised and require lots of training data in order to accurately capture the descriptive and discriminative information from the defined categories. Considering that training sets are extremely small in many classification tasks, in this paper we explore the use of unsupervised extractive summarization as a feature selection technique for document categorization. Our experiments using training sets of different sizes indicate that text summarization is a competitive approach for feature selection, and show its appropriateness for situations having small training sets, where it could clearly outperform the traditional information gain technique.