Combining confidence score and mal-rule filters for automatic creation of bangla error corpus: grammar checker perspective

Authors:
Bibekananda Kundu;Sutanu Chakraborti;Sanjay Kumar Choudhury
Affiliations:
Language Technology, Centre for Development of Advance Computing, Kolkata, India and Department of Computer Science and Engineering, Indian Institution of Technology, Chennai, India;Department of Computer Science and Engineering, Indian Institution of Technology, Chennai, India;Language Technology, Centre for Development of Advance Computing, Kolkata, India
Venue:
CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Year:
2012

Citing 8
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
An intelligent tutoring system for deaf learners of written English

Assets '00 Proceedings of the fourth international ACM conference on Assistive technologies
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
GramCheck: a grammar and style checker

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Parsing for grammar and style checking

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 2
Correcting ESL errors using phrasal SMT techniques

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
GenERRate: generating errors for use in grammatical error detection

EdAppsNLP '09 Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications
Automated Grammatical Error Detection for Language Learners

Automated Grammatical Error Detection for Language Learners

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a novel approach for automatic creation of Bangla error corpus for training and evaluation of grammar checker systems. The procedure begins with automatic creation of large number of erroneous sentences from a set of grammatically correct sentences. A statistical Confidence Score Filter has been implemented to select proper samples from the generated erroneous sentences such that sentences with less probable word sequences get lower confidence score and vice versa. Rule based Mal-rule filter with HMM based semi-supervised POS tagger has been used to collect the sentences having improper tag sequences. Combination of these two filters ensures the robustness of the proposed approach such that no valid construction is getting selected within the synthetically generated error corpus. Though the present work focuses on the most frequent grammatical errors in Bangla written text, detail taxonomy of grammatical errors in Bangla is also presented here, with an aim to increase the coverage of the error corpus in future. The proposed approach is language independent and could be easily applied for creating similar corpora in other languages.