A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition

Authors:
Sujan Kumar Saha;Pabitra Mitra;Sudeshna Sarkar
Affiliations:
Dept. of CSE, Birla Institute of Technology Mesra, Ranchi 835215, India;Dept. of CSE, Indian Institute of Technology Kharagpur, Kharagpur 721302, India;Dept. of CSE, Indian Institute of Technology Kharagpur, Kharagpur 721302, India
Venue:
Knowledge-Based Systems
Year:
2012

Citing 31
Cited 1

Word association norms, mutual information, and lexicography

Computational Linguistics
A practical approach to feature selection

ML92 Proceedings of the ninth international workshop on Machine learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Class-based n-gram models of natural language

Computational Linguistics
Estimating attributes: analysis and extensions of RELIEF

ECML-94 Proceedings of the European conference on machine learning on Machine Learning
A maximum entropy approach to natural language processing

Computational Linguistics
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Induction of Decision Trees

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Grafting: fast, incremental feature selection by gradient descent in function space

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Rapid development of Hindi named entity recognition using conditional random fields and feature induction

ACM Transactions on Asian Language Information Processing (TALIP)
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Hierarchical clustering of words

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Feature selection, L1 vs. L2 regularization, and rotational invariance

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Introducing a Family of Linear Measures for Feature Selection in Text Categorization

IEEE Transactions on Knowledge and Data Engineering
Maximum entropy models for named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
An intelligent information retrieval agent

Knowledge-Based Systems
Multinomial mixture model with feature selection for text clustering

Knowledge-Based Systems
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Multi-documents Automatic Abstracting based on text clustering and semantic analysis

Knowledge-Based Systems
Graph-based word clustering using a web search engine

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems

TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Corpus callosum MR image classification

Knowledge-Based Systems
HMM and fuzzy logic: A hybrid approach for online Urdu script-based languages' character recognition

Knowledge-Based Systems
Multiobjective optimization for classifier ensemble and feature selection: an application to named entity recognition

International Journal on Document Analysis and Recognition

Large-margin feature selection for monotonic classification

Knowledge-Based Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Features used for named entity recognition (NER) are often high dimensional in nature. These cause overfitting when training data is not sufficient. Dimensionality reduction leads to performance enhancement in such situations. There are a number of approaches for dimensionality reduction based on feature selection and feature extraction. In this paper we perform a comprehensive and comparative study on different dimensionality reduction approaches applied to the NER task. To compare the performance of the various approaches we consider two Indian languages namely Hindi and Bengali. NER accuracies achieved in these languages are comparatively poor as yet, primarily due to scarcity of annotated corpus. For both the languages dimensionality reduction is found to improve performance of the classifiers. A Comparative study of the effectiveness of several dimensionality reduction techniques is presented in detail in this paper.