Soft indexing of speech content for search in spoken documents

Authors:
Ciprian Chelba;Jorge Silva;Alex Acero
Affiliations:
Speech Research Group, Microsoft Research, One Microsoft Way, Redmond, WA 98052, United States;Speech Analysis and Interpretation Laboratory (SAIL), Department of Electrical Engineering, Viterbi School of Engineering, University of Southern California, USA and Electrical Engineering Departm ...;Speech Research Group, Microsoft Research, One Microsoft Way, Redmond, WA 98052, United States
Venue:
Computer Speech and Language
Year:
2007

Citing 10
Cited 12

Open-vocabulary speech indexing for voice and video mail retrieval

MULTIMEDIA '96 Proceedings of the fourth ACM international conference on Multimedia
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Effects of out of vocabulary words in spoken document retrieval (poster session)

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Integration of continuous speech recognition and information retrieval for mutually optimal performance

Integration of continuous speech recognition and information retrieval for mutually optimal performance
Subword-based approaches for spoken document retrieval

Subword-based approaches for spoken document retrieval
Position specific posterior lattices for indexing speech

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Towards spoken-document retrieval for the internet: lattice indexing for large-scale web-search architectures

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Analysis and processing of lecture audio data: preliminary investigations

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004

A critical assessment of spoken utterance retrieval through approximate lattice representations

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Investigating the Global Semantic Impact of Speech Recognition Error on Spoken Content Collections

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Statistical lattice-based spoken document retrieval

ACM Transactions on Information Systems (TOIS)
A novel Chinese mandarin speech indexing method based on confusion network using tone information

WiCOM'09 Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing
Using confusion networks for speech summarization

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Performance analysis for lattice-based speech indexing approaches using words and subword units

IEEE Transactions on Audio, Speech, and Language Processing
Supporting collaborative transcription of recorded speech with a 3D game interface

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IV
Speech retrieval from unsegmented finnish audio using statistical morpheme-like units for segmentation, recognition, and retrieval

ACM Transactions on Speech and Language Processing (TSLP)
Spoken Content Retrieval: A Survey of Techniques and Technologies

Foundations and Trends in Information Retrieval
Extractive speech summarization using evaluation metric-related training criteria

Information Processing and Management: an International Journal
Approaches for the detection of the keywords in spoken documents application for the field of e-libraries

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part IV
Improved Semantic Retrieval of Spoken Content by Document/Query Expansion with Random Walk Over Acoustic Similarity Graphs

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper presents the Position Specific Posterior Lattice (PSPL), a novel lossy representation of automatic speech recognition lattices that naturally lends itself to efficient indexing and subsequent relevance ranking of spoken documents. This technique explicitly takes into consideration the content uncertainty by means of using soft-hits. Indexing position information allows one to approximate N-gram expected counts and at the same time use more general proximity features in the relevance score calculation. In fact, one can easily port any state-of-the-art text-retrieval algorithm to the scenario of indexing ASR lattices for spoken documents, rather than using the 1-best recognition result. Experiments performed on a collection of lecture recordings-MIT iCampus database-show that the spoken document ranking performance was improved by 17-26% relative over the commonly used baseline of indexing the 1-best output from an automatic speech recognizer (ASR). The paper also addresses the problem of integrating speech and text content sources for the document search problem, as well as its usefulness from an ad hoc retrieval-keyword search-point of view. In this context, the PSPL formulation is naturally extended to deal with both speech and text content for a given document, where a new relevance ranking framework is proposed for integrating the different sources of information available. Experimental results on the MIT iCampus corpus show a relative improvement of 302% in Mean Average Precision (MAP) when using speech content and text-only metadata as opposed to just text-only metadata (which constitutes about 1% of the amount of data in the transcription of the speech content, measured in number of words). Further experiments show that even in scenarios for which the metadata size is artificially augmented such that it contains more than 10% of the spoken document transcription, the speech content still provides significant performance gains in MAP with respect to only using the text-metadata for relevance ranking.