Temporal Issues and Recognition Errors on the Capitalization of Speech Transcriptions

Authors:
Fernando Batista;Nuno Mamede;Isabel Trancoso
Affiliations:
L2F - Spoken Language Systems Laboratory - INESC ID Lisboa, Lisboa, Portugal 1000-029 and ISCTE - Instituto de Cièncias do Trabalho e da Empresa, Portugal;L2F - Spoken Language Systems Laboratory - INESC ID Lisboa, Lisboa, Portugal 1000-029 and IST - Instituto Superior Técnico, Portugal;L2F - Spoken Language Systems Laboratory - INESC ID Lisboa, Lisboa, Portugal 1000-029 and IST - Instituto Superior Técnico, Portugal
Venue:
TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Year:
2008

Citing 4
Cited 0

A maximum entropy approach to natural language processing

Computational Linguistics
tRuEcasIng

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Capitalizing machine translation

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
AUDIMUS.MEDIA: a broadcast news speech recognition system for the european portuguese language

PROPOR'03 Proceedings of the 6th international conference on Computational processing of the Portuguese language

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates the capitalization task over Broadcast News speech transcriptions. Most of the capitalization information is provided by two large newspaper corpora, and the spoken language model is produced by retraining the newspaper language models with spoken data. Three different corpora subsets from different time periods are used for evaluation, revealing the importance of available training data in nearby time periods. Results are provided both for manual and automatic transcriptions, showing also the impact of the recognition errors in the capitalization task. Our approach is based on maximum entropy models and uses unlimited vocabulary. The language model produced with this approach can be sorted and then pruned, in order to reduce computational resources, without much impact in the final results.