Speech retrieval from unsegmented finnish audio using statistical morpheme-like units for segmentation, recognition, and retrieval

Authors:
Ville T. Turunen;Mikko Kurimo
Affiliations:
Aalto University School of Science, Aalto, Finland;Aalto University School of Science, Aalto, Finland
Venue:
ACM Transactions on Speech and Language Processing (TSLP)
Year:
2008

Citing 32
Cited 0

Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Effects of out of vocabulary words in spoken document retrieval (poster session)

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Prosody-based automatic segmentation of speech into sentences and topics

Speech Communication - Special issue on accessing information in spoken audio
Searching the Web: the public and their queries

Journal of the American Society for Information Science and Technology
Topic segmentation with an aspect hidden Markov model

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
From Plain Character Strings to Meaningful Words: Producing Better Full Text Databases for Inflectional and Compounding Languages with Morphological Analysis Software

Information Retrieval
Using graded relevance assessments in IR evaluation

Journal of the American Society for Information Science and Technology
Integration of continuous speech recognition and information retrieval for mutually optimal performance

Integration of continuous speech recognition and information retrieval for mutually optimal performance
Subword-based approaches for spoken document retrieval

Subword-based approaches for spoken document retrieval
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Discourse segmentation by human and automated means

Computational Linguistics
Integrating prosodic and lexical cues for automatic topic segmentation

Computational Linguistics
Text segmentation based on similarity between words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
SeLeCT: a lexical cohesion based news story segmentation system

AI Communications - STAIRS 2002
Discourse segmentation of multi-party conversation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Spoken document retrieval from call-center conversations

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
One-sided measures for evaluating ranked retrieval effectiveness with spontaneous conversational speech

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Position specific posterior lattices for indexing speech

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Searching the audio notebook: keyword search in recorded conversations

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Towards spoken-document retrieval for the internet: lattice indexing for large-scale web-search architectures

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Soft indexing of speech content for search in spoken documents

Computer Speech and Language
Indexing confusion networks for morph-based spoken document retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Morph-based speech recognition and modeling of out-of-vocabulary words across languages

ACM Transactions on Speech and Language Processing (TSLP)
Morpho Challenge Evaluation Using a Linguistic Gold Standard

Advances in Multilingual and Multimodal Information Retrieval
A critical assessment of spoken utterance retrieval through approximate lattice representations

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Statistical lattice-based spoken document retrieval

ACM Transactions on Information Systems (TOIS)
Multi-scale TextTiling for automatic story segmentation in Chinese broadcast news

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
CLEF 2008: ad hoc track overview

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Importance of High-Order N-Gram Models in Morph-Based Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article examines the use of statistically discovered morpheme-like units for Spoken Document Retrieval (SDR). The morpheme-like units (morphs) are used both for language modeling in speech recognition and as index terms. Traditional word-based methods suffer from out-of-vocabulary words. If a word is not in the recognizer vocabulary, any occurrence of the word in speech will be missing from the transcripts. The problem is especially severe for languages with a high number of distinct word forms such as Finnish. With the morph language model, even previously unseen words can be recognized by identifying its component morphs. Similarly in information retrieval queries, complex word forms, even unseen ones, can be matched to data after segmenting them to morphs. Retrieval performance can be further improved by expanding the transcripts with alternative recognition results from confusion networks. In this article, a novel retrieval evaluation corpus consisting of unsegmented Finnish radio programs, 25 queries and corresponding human relevance assessments was constructed. Previous results on using morphs and confusion networks for Finnish SDR are confirmed and extended to the unsegmented case. As previously, using morphs or base forms as index terms yields about equal performance but combination methods, including a new one, are found to work better than either alone. Using alternative morph segmentations of the query words is found to further improve the results. Lexical similarity-based story segmentation was applied and performance using morphs, base forms, and their combinations was compared for the first time.