Arabic speech and text in TIDES OnTAP

  • Authors:
  • Jayadev Billa;Mohamed Noamany;Amit Srivastava;John Makhoul;Francis Kubala

  • Affiliations:
  • BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA

  • Venue:
  • HLT '02 Proceedings of the second international conference on Human Language Technology Research
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes the introduction of Arabic speech and text into the TIDES OnTAP system. This includes the development of the BBN Audio Indexing System for broadcast news in Arabic, development and the introduction of an Arabic event tracker and Arabic querying into the TIDES OnTAP system. Key issues addressed in this work revolve around the three major components of the audio indexing system: automatic speech recognition, speaker identification, named entity identification and Arabic document tracking. The system deals with several challenges introduced by the Arabic language, including the absence of short vowels in written text and the presence of compound words that are formed by the concatenation of certain conjunctions, prepositions, articles, and pronouns, as prefixes and suffixes to the word stem. The absence of short vowels in the transcripts was addressed with a novel solution that leverages the strengths of Hidden Markov models. Another challenge was the acquisition of appropriate language modeling data, given the absence of broadcast news data for that purpose. We present performance results for all three components of the Audio Indexing System.