An extensive empirical study of feature terms selection for text summarization and categorization

Authors:
Suneetha Manne;S. Sameen Fatima
Affiliations:
VR Siddhartha Engineering College, Vijayawada;Osmania University, Hyderabad
Venue:
Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Year:
2012

Citing 4
Cited 0

Automatic condensation of electronic publications by sentence selection

Information Processing and Management: an International Journal - Special issue: summarizing text
New Methods in Automatic Extracting

Journal of the ACM (JACM)
Automatic summarization of search engine hit lists

RANLPIR '00 Proceedings of the ACL-2000 workshop on Recent advances in natural language processing and information retrieval: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 11
The automatic creation of literature abstracts

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ever-increasing availability of online textual data bases and the development of Internet have necessitated intensive research in the area of automatic text summarization within the Natural Language Processing (NLP) community. Researchers and students constantly face the problem when they work on a research project that, it is almost impossible to read most of the newly published papers. The goal of text summarization based on extraction approach is sentences selection. One of the methods to obtain the sentences is to assign some feature terms of sentences for the summary called ranking sentences and then select the best ones. Broad indexing and speedy search alone are not enough for effective retrieval. Categorized data are easy for user to browse if the data is well organized. In the first stage each document is prepared by preprocessing process: sentence segmentation, tokenization, stop word removal, case folding, lemmatization, and stemming. Then, we used important features, sentence filtering features, data compression features and finally calculate their score for each sentence. We proposed text summarization based on HMM tagger to improve the quality of the summary. By creating impressions the documents are also categorized. We compared our results with the Copernicus summarizer, Great summarizer and Microsoft Word 2007 summarizers etc. The proposed system is tested with four types' similarities: Cosine, Jaccard, Jaro-winkler and Sorenson similarities. The results show that the best quality for the summaries was obtained by feature terms method. Our text categorization approach is validated with Naïve Bayesian, Decision Tree Induction, KNN and SVM approaches.