Wrappers for feature subset selection
Artificial Intelligence - Special issue on relevance
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
High-performing feature selection for text classification
Proceedings of the eleventh international conference on Information and knowledge management
Developing Reusable and Robust Language Processing Components for Information Systems using GATE
DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
An introduction to variable and feature selection
The Journal of Machine Learning Research
Semantic Feature Selection Using WordNet
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Data Mining and Knowledge Discovery Handbook
Data Mining and Knowledge Discovery Handbook
Collective annotation of Wikipedia entities in web text
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Applying RDF Ontologies to Improve Text Classification
CINC '09 Proceedings of the 2009 International Conference on Computational Intelligence and Natural Computing - Volume 02
Automatic text categorization based on content analysis with cognitive situation models
Information Sciences: an International Journal
Keyword Combination Extraction in Text Categorization Based on Ant Colony Optimization
SOCPAR '09 Proceedings of the 2009 International Conference of Soft Computing and Pattern Recognition
Hi-index | 0.00 |
In text categorization, feature extraction is one of the major strategies that aim at making text classifiers more efficient and accurate. Selecting quickly a suitable strategy for feature extraction out of many strategies proposed by previous studies is difficult. In this paper, we propose an efficient entity extraction approach for feature extraction which contributes towards accurate text categorization. In the proposed approach the entities identified are person name, organization name, location and date. We have used the GATE tool for extraction of these entities. After the entities are identified we have annotated each of these entities in the original text with parameters. There are three measures used for feature selection, term frequency (TF), information gain (IG) and chi-square (χ2). The effectiveness and accuracy of the entity annotated features is judged by using these features for classification and comparing the results against the non-annotated features. The experimentation is performed on standard benchmarking datasets such as NFS Abstract datasets and Reuters-21578. The experimental results predict that the accuracy of text categorization using the annotated features is better for NFS Abstract-Title dataset as compared to non-annotated features. For Reuters-21578, however, there wasn't a significant improvement in accuracy of classification.