Combining confidence score and mal-rule filters for automatic creation of bangla error corpus: grammar checker perspective

  • Authors:
  • Bibekananda Kundu;Sutanu Chakraborti;Sanjay Kumar Choudhury

  • Affiliations:
  • Language Technology, Centre for Development of Advance Computing, Kolkata, India and Department of Computer Science and Engineering, Indian Institution of Technology, Chennai, India;Department of Computer Science and Engineering, Indian Institution of Technology, Chennai, India;Language Technology, Centre for Development of Advance Computing, Kolkata, India

  • Venue:
  • CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a novel approach for automatic creation of Bangla error corpus for training and evaluation of grammar checker systems. The procedure begins with automatic creation of large number of erroneous sentences from a set of grammatically correct sentences. A statistical Confidence Score Filter has been implemented to select proper samples from the generated erroneous sentences such that sentences with less probable word sequences get lower confidence score and vice versa. Rule based Mal-rule filter with HMM based semi-supervised POS tagger has been used to collect the sentences having improper tag sequences. Combination of these two filters ensures the robustness of the proposed approach such that no valid construction is getting selected within the synthetically generated error corpus. Though the present work focuses on the most frequent grammatical errors in Bangla written text, detail taxonomy of grammatical errors in Bangla is also presented here, with an aim to increase the coverage of the error corpus in future. The proposed approach is language independent and could be easily applied for creating similar corpora in other languages.