Feature annotation for text categorization

  • Authors:
  • Yashodhara Haribhakta;Santosh Kalamkar;Parag Kulkarni

  • Affiliations:
  • College of Engineering, Pune Shivajinagar, Pune, Maharashtra, India;College of Engineering, Pune Shivajinagar, Pune, Maharashtra, India;College of Engineering, Pune Shivajinagar, Pune Maharashtra, India

  • Venue:
  • Proceedings of the CUBE International Information Technology Conference
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In text categorization, feature extraction is one of the major strategies that aim at making text classifiers more efficient and accurate. Selecting quickly a suitable strategy for feature extraction out of many strategies proposed by previous studies is difficult. In this paper, we propose an efficient entity extraction approach for feature extraction which contributes towards accurate text categorization. In the proposed approach the entities identified are person name, organization name, location and date. We have used the GATE tool for extraction of these entities. After the entities are identified we have annotated each of these entities in the original text with parameters. There are three measures used for feature selection, term frequency (TF), information gain (IG) and chi-square (χ2). The effectiveness and accuracy of the entity annotated features is judged by using these features for classification and comparing the results against the non-annotated features. The experimentation is performed on standard benchmarking datasets such as NFS Abstract datasets and Reuters-21578. The experimental results predict that the accuracy of text categorization using the annotated features is better for NFS Abstract-Title dataset as compared to non-annotated features. For Reuters-21578, however, there wasn't a significant improvement in accuracy of classification.