A corpus balancing method for language model construction

  • Authors:
  • Luis Villaseñor-Pineda;Manuel Montes-Y-Gómez;Manuel Alberto Pérez-Coutiño;Dominique Vaufreydaz

  • Affiliations:
  • Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico;Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico;Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico;Laboratoire CLIPS-IMAG, Université Joseph Fourier, France

  • Venue:
  • CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The language model is an important component of any speech recognition system. In this paper, we present a lexical enrichment methodology of corpora focused on the construction of statistical language models. This methodology considers, on one hand, the identification of the set of poor represented words of a given training corpus, and on the other hand, the enrichment of the given corpus by the repetitive inclusion of selected text fragments containing these words. The first part of the paper describes the formal details about this methodology; the second part presents some experiments and results that validate our method.