A comparison of front-ends for bitstream-based ASR over IP

Authors:
Carmen Peláez-Moreno;Ascensión Gallardo-Antolín;Diego F. Gómez-Cajas;Fernando Díaz-de-María
Affiliations:
Dpto. de Teoría de la Señal y Comunicaciones, EPS-Universidad Carlos III de Madrid, Avda. de la Universidad, Leganés, Madrid, Spain;Dpto. de Teoría de la Señal y Comunicaciones, EPS-Universidad Carlos III de Madrid, Avda. de la Universidad, Leganés, Madrid, Spain;Dpto. de Teoría de la Señal y Comunicaciones, EPS-Universidad Carlos III de Madrid, Avda. de la Universidad, Leganés, Madrid, Spain;Dpto. de Teoría de la Señal y Comunicaciones, EPS-Universidad Carlos III de Madrid, Avda. de la Universidad, Leganés, Madrid, Spain
Venue:
Signal Processing
Year:
2006

Citing 6
Cited 3

Speech analysis and synthesis methods developed at ECL in NTT-From LPC to LSP-

Speech Communication - Special issue: Speech research in Japan
Measurements and analysis of end-to-end Internet dynamics

Measurements and analysis of end-to-end Internet dynamics
Speech recognition using quantized LSP parameters and their transformations in digital communication

Speech Communication
Digital Speech; Coding for Low Bit Rate Communication Systems

Digital Speech; Coding for Low Bit Rate Communication Systems
LSP weighting functions based on spectral sensitivity and mel-frequency warping for speech recognition in digital communication

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Recognizing voice over IP: a robust front-end for speechrecognition on the world wide web

IEEE Transactions on Multimedia

Fast communication: Cepstral domain interpretations of line spectral frequencies

Signal Processing
Review: Line spectral pairs

Signal Processing
Robust distributed speech recognition in noise and packet loss conditions

Digital Signal Processing

Quantified Score

Hi-index	0.08

Visualization

Abstract

Automatic speech recognition (ASR) is called to play a relevant role in the provision of spoken interfaces for IP-based applications. However, as a consequence of the transit of the speech signal over these particular networks, ASR systems need to face two new challenges: the impoverishment of the speech quality due to the compression needed to fit the channel capacity and the inevitable occurrence of packet losses.In this framework, bitstream-based approaches that obtain the ASR feature vectors directly from the coded bitstream, avoiding the speech decoding process, have been proposed ([S.H. Choi, H.K. Kim, H.S. Lee, Speech recognition using quantized LSP parameters and their transformations in digital communications, Speech Commun. 30 (4) (2000) 223-233. A. Gallardo-Antolín, C. Pelàez-Moreno, F. Díaz-de-María, Recognizing GSM digital speech, IEEE Trans. Speech Audio Process., to appear. H.K. Kim, R.V. Cox, R.C. Rose, Performance improvement of a bitstream-based front-end for wireless speech recognition in adverse environments, IEEE Trans. Speech Audio Process. 10 (8) (2002) 591-604. C. Peláez-Moreno, A. Gallardo-Antolín, F. Díaz-de-María, Recognizing voice over IP networks: a robust front-end for speech recognition on the WWW, IEEE Trans. Multimedia 3(2) (2001) 209-218], among others) to improve the robustness of ASR systems. LSP (Line Spectral Pairs) are the preferred set of parameters for the description of the speech spectral envelope in most of the modern speech coders. Nevertheless, LSP have proved to be unsuitable for ASR, and they must be transformed into cepstrum-type parameters. In this paper we comparatively evaluate the robustness of the most significant LSP to cepstrum transformations in a simulated VoIP (voice over IP) environment which includes two of the most popular codecs used in that network (G.723.1 and G.729) and several network conditions. In particular, we compare 'pseudocepstrum' [H.K. Kim, S.H. Choi, H.S. Lee, On approximating Line Spectral Frequencies to LPC cepstral coefficients, IEEE Trans. Speech Audio Process. 8 (2) (2000) 195-199], an approximated but straightforward transformation of LSP into LP cepstral coefficients, with a more computationally demanding but exact one. Our results show that pseudocepstrum is preferable when network conditions are good or computational resources low, while the exact procedure is recommended when network conditions become more adverse.