Dynamic language modeling for European Portuguese

  • Authors:
  • Ciro Martins;António Teixeira;João Neto

  • Affiliations:
  • Department Electronics, Telecommunications & Informatics/IEETA - Aveiro University, Aveiro, Portugal and L2F - Spoken Language Systems Lab - INESC-ID/IST, Lisbon, Portugal;Department Electronics, Telecommunications & Informatics/IEETA - Aveiro University, Aveiro, Portugal;L2F - Spoken Language Systems Lab - INESC-ID/IST, Lisbon, Portugal

  • Venue:
  • Computer Speech and Language
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper reports on the work done on vocabulary and language model daily adaptation for a European Portuguese broadcast news transcription system. The proposed adaptation framework takes into consideration European Portuguese language characteristics, such as its high level of inflection and complex verbal system. A multi-pass speech recognition framework using contemporary written texts available daily on the Web is proposed. It uses morpho-syntactic knowledge (part-of-speech information) about an in-domain training corpus for daily selection of an optimal vocabulary. Using an information retrieval engine and the ASR hypotheses as query material, relevant documents are extracted from a dynamic and large-size dataset to generate a story-based language model. When applied to a daily and live closed-captioning system of live TV broadcasts, it was shown to be effective, with a relative reduction of out-of-vocabulary word rate (69%) and WER (12.0%) when compared to the results obtained by the baseline system with the same vocabulary size.