Detection of language (model) errors

Authors:
K. Y. Hung;R. W. P. Luk;D. Yeung;K. F. L. Chung;W. Shu
Affiliations:
Hong Kong Polytechnic University, Hong Kong;Hong Kong Polytechnic University, Hong Kong;Hong Kong Polytechnic University, Hong Kong;Hong Kong Polytechnic University, Hong Kong;Hong Kong Polytechnic University, Hong Kong
Venue:
EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Year:
2000

Citing 15
Cited 1

A review of segmentation and contextual analysis techniques for text recognition

Pattern Recognition
Self-organized language modeling for speech recognition

Readings in speech recognition
C4.5: programs for machine learning

C4.5: programs for machine learning
Class-based n-gram models of natural language

Computational Linguistics
The nature of statistical learning theory

The nature of statistical learning theory
The Normalized String Editing Problem Revisited

IEEE Transactions on Pattern Analysis and Machine Intelligence
The power of amnesia: learning probabilistic automata with variable memory length

Machine Learning - Special issue on COLT '94
Statistical estimation of the number of hidden units for feedforward neural networks

Neural Networks
The String-to-String Correction Problem

Journal of the ACM (JACM)
Information Retrieval

Information Retrieval
A language model based on semantically clustered words in a Chinese character recognition system

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
N-th order Ergodic Multigram HMM for modeling of languages without marked word boundaries

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
A class based language model for speech recognition

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Improved topic-dependent language modeling using information retrieval techniques

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Postprocessing statistical language models for handwritten Chinesecharacter recognizer

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Discovering "title-like" terms

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The bigram language models are popular, in much language processing applications, in both Indo-European and Asian languages. However, when the language model for Chinese is applied in a novel domain, the accuracy is reduced significantly, from 96% to 78% in our evaluation. We apply pattern recognition techniques (i.e. Bayesian, decision tree and neural network classifiers) to discover language model errors. We have examined 2 general types of features: model-based and language-specific features. In our evaluation, Bayesian classifiers produce the best recall performance of 80% but the precision is low (60%). Neural network produced good recall (75%) and precision (80%) but both Bayesian and Neural network have low skip ratio (65%). The decision tree classifier produced the best precision (81%) and skip ratio (76%) but its recall is the lowest (73%).