An extensive empirical study of feature terms selection for text summarization and categorization

  • Authors:
  • Suneetha Manne;S. Sameen Fatima

  • Affiliations:
  • VR Siddhartha Engineering College, Vijayawada;Osmania University, Hyderabad

  • Venue:
  • Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The ever-increasing availability of online textual data bases and the development of Internet have necessitated intensive research in the area of automatic text summarization within the Natural Language Processing (NLP) community. Researchers and students constantly face the problem when they work on a research project that, it is almost impossible to read most of the newly published papers. The goal of text summarization based on extraction approach is sentences selection. One of the methods to obtain the sentences is to assign some feature terms of sentences for the summary called ranking sentences and then select the best ones. Broad indexing and speedy search alone are not enough for effective retrieval. Categorized data are easy for user to browse if the data is well organized. In the first stage each document is prepared by preprocessing process: sentence segmentation, tokenization, stop word removal, case folding, lemmatization, and stemming. Then, we used important features, sentence filtering features, data compression features and finally calculate their score for each sentence. We proposed text summarization based on HMM tagger to improve the quality of the summary. By creating impressions the documents are also categorized. We compared our results with the Copernicus summarizer, Great summarizer and Microsoft Word 2007 summarizers etc. The proposed system is tested with four types' similarities: Cosine, Jaccard, Jaro-winkler and Sorenson similarities. The results show that the best quality for the summaries was obtained by feature terms method. Our text categorization approach is validated with Naïve Bayesian, Decision Tree Induction, KNN and SVM approaches.