A corpus balancing method for language model construction

Authors:
Luis Villaseñor-Pineda;Manuel Montes-Y-Gómez;Manuel Alberto Pérez-Coutiño;Dominique Vaufreydaz
Affiliations:
Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico;Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico;Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico;Laboratoire CLIPS-IMAG, Université Joseph Fourier, France
Venue:
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Year:
2003

Citing 5
Cited 2

Information Retrieval Systems: Theory and Implementation

Information Retrieval Systems: Theory and Implementation
Designing Interactive Speech Systems: From First Ideas to User Testing

Designing Interactive Speech Systems: From First Ideas to User Testing
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
The DIME Project

MICAI '02 Proceedings of the Second Mexican International Conference on Artificial Intelligence: Advances in Artificial Intelligence
Compilation of a Spanish Representative Corpus

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing

Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Enhancing Cross-Language Question Answering by Combining Multiple Question Translations

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The language model is an important component of any speech recognition system. In this paper, we present a lexical enrichment methodology of corpora focused on the construction of statistical language models. This methodology considers, on one hand, the identification of the set of poor represented words of a given training corpus, and on the other hand, the enrichment of the given corpus by the repetitive inclusion of selected text fragments containing these words. The first part of the paper describes the formal details about this methodology; the second part presents some experiments and results that validate our method.