Dynamic language modeling for European Portuguese

Authors:
Ciro Martins;António Teixeira;João Neto
Affiliations:
Department Electronics, Telecommunications & Informatics/IEETA - Aveiro University, Aveiro, Portugal and L2F - Spoken Language Systems Lab - INESC-ID/IST, Lisbon, Portugal;Department Electronics, Telecommunications & Informatics/IEETA - Aveiro University, Aveiro, Portugal;L2F - Spoken Language Systems Lab - INESC-ID/IST, Lisbon, Portugal
Venue:
Computer Speech and Language
Year:
2010

Citing 6
Cited 1

A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding

A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The LIMSI Broadcast News transcription system

Speech Communication - Special issue on automatic transcription of broadcast news data
Modelling out-of-vocabulary words for robust speech recognition

Modelling out-of-vocabulary words for robust speech recognition
AUDIMUS.MEDIA: a broadcast news speech recognition system for the european portuguese language

PROPOR'03 Proceedings of the 6th international conference on Computational processing of the Portuguese language
Using morphossyntactic information in TTS systems: comparing strategies for European Portuguese

PROPOR'03 Proceedings of the 6th international conference on Computational processing of the Portuguese language

On the dynamic adaptation of language models based on dialogue information

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports on the work done on vocabulary and language model daily adaptation for a European Portuguese broadcast news transcription system. The proposed adaptation framework takes into consideration European Portuguese language characteristics, such as its high level of inflection and complex verbal system. A multi-pass speech recognition framework using contemporary written texts available daily on the Web is proposed. It uses morpho-syntactic knowledge (part-of-speech information) about an in-domain training corpus for daily selection of an optimal vocabulary. Using an information retrieval engine and the ASR hypotheses as query material, relevant documents are extracted from a dynamic and large-size dataset to generate a story-based language model. When applied to a daily and live closed-captioning system of live TV broadcasts, it was shown to be effective, with a relative reduction of out-of-vocabulary word rate (69%) and WER (12.0%) when compared to the results obtained by the baseline system with the same vocabulary size.