Information retrieval test collection for searching spontaneous Czech speech

Authors:
Pavel Ircing;Pavel Pecina;Douglas W. Oard;Jianqiang Wang;Ryen W. White;Jan Hoidekr
Affiliations:
University of West Bohemia, Faculty of Applied Sciences, Department of Cybernetics, Plzeň, Czech Republic;Charles University, Institute of Formal and Applied Linguistic, Praha, Czech Republic;University of Maryland, College of Information Studies, UMIACS, College Park, MD;State University of New York at Buffalo, Department of Library and Information Studies, Buffalo, NY;Microsoft Research, Redmond, WA;University of West Bohemia, Faculty of Applied Sciences, Department of Cybernetics, Plzeň, Czech Republic
Venue:
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Year:
2007

Citing 6
Cited 3

Building an information retrieval test collection for spontaneous conversational speech

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Cross-language text classification

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
One-sided measures for evaluating ranked retrieval effectiveness with spontaneous conversational speech

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Leveraging reusability: cost-effective lexical acquisition for large-scale ontology translation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
First experiments searching spontaneous Czech speech

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Corrective models for speech recognition of inflected languages

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Experiments with Automatic Query Formulation in the Extended Boolean Model

TSD '09 Proceedings of the 12th International Conference on Text, Speech and Dialogue
Comparison of different lemmatization approaches through the means of information retrieval performance

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Penalty functions for evaluation measures of unsegmented speech retrieval

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the design of the first large-scale IR test collection built for the Czech language. The creation of this collection also happens to be very challenging, as it is based on a continuous text stream from automatic transcription of spontaneous speech and thus lacks clearly defined document boundaries. All aspects of the collection building are presented, together with some general findings of initial experiments.