Arabic speech and text in TIDES OnTAP

Authors:
Jayadev Billa;Mohamed Noamany;Amit Srivastava;John Makhoul;Francis Kubala
Affiliations:
BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA
Venue:
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Year:
2002

Citing 3
Cited 2

An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Integrated technologies for indexing spoken language

Communications of the ACM
Probabilistic models for topic detection and tracking

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01

Cross-word Arabic pronunciation variation modeling for speech recognition

International Journal of Speech Technology
Within-word pronunciation variation modeling for Arabic ASRs: a direct data-driven approach

International Journal of Speech Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the introduction of Arabic speech and text into the TIDES OnTAP system. This includes the development of the BBN Audio Indexing System for broadcast news in Arabic, development and the introduction of an Arabic event tracker and Arabic querying into the TIDES OnTAP system. Key issues addressed in this work revolve around the three major components of the audio indexing system: automatic speech recognition, speaker identification, named entity identification and Arabic document tracking. The system deals with several challenges introduced by the Arabic language, including the absence of short vowels in written text and the presence of compound words that are formed by the concatenation of certain conjunctions, prepositions, articles, and pronouns, as prefixes and suffixes to the word stem. The absence of short vowels in the transcripts was addressed with a novel solution that leverages the strengths of Hidden Markov models. Another challenge was the acquisition of appropriate language modeling data, given the absence of broadcast news data for that purpose. We present performance results for all three components of the Audio Indexing System.