A speech mashup framework for multimodal mobile services

Authors:
Giuseppe Di Fabbrizio;Thomas Okken;Jay G. Wilpon
Affiliations:
AT&T Labs - Research, Inc., Florham Park, NJ, USA;AT&T Labs - Research, Inc., Florham Park, NJ, USA;AT&T Labs - Research, Inc., Florham Park, NJ, USA
Venue:
Proceedings of the 2009 international conference on Multimodal interfaces
Year:
2009

Citing 10
Cited 8

Building voiceXML browsers with openVXI

Proceedings of the 11th international conference on World Wide Web
An architecture to provide adaptive, synchronized and multimodal human computer interaction

Proceedings of the tenth ACM international conference on Multimedia
Mobile Multi-Modal Data Services for GPRS Phones and Beyond

ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
Architectural styles and the design of network-based software architectures

Architectural styles and the design of network-based software architectures
Understanding SOA with Web Services (Independent Technology Guides)

Understanding SOA with Web Services (Independent Technology Guides)
SALT: an XML application for web-based multimodal dialog management

NLPXML '02 Proceedings of the 2nd workshop on NLP and XML - Volume 17
Noxes: a client-side solution for mitigating cross-site scripting attacks

Proceedings of the 2006 ACM symposium on Applied computing
Pro Web 2.0 Mashups: Remixing Data and Web Services (Proffesional Reference Series)

Pro Web 2.0 Mashups: Remixing Data and Web Services (Proffesional Reference Series)
The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Query parsing for voice-enabled mobile local search

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing

Building multimodal applications with EMMA

Proceedings of the 2009 international conference on Multimodal interfaces
A comparison of speech and GUI input for navigation in complex visualizations on mobile devices

Proceedings of the 12th international conference on Human computer interaction with mobile devices and services
Iwalk: a lightweight navigation system for low-vision users

Proceedings of the 12th international ACM SIGACCESS conference on Computers and accessibility
Location grounding in multimodal local search

International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction
VoiSTV: voice-enabled social TV

Proceedings of the 20th international conference companion on World wide web
A unifying architecture for easy development, deployment and management of voice-driven mobile applications

Proceedings of the 7th International Conference on Network and Services Management
After dialog went pervasive: separating dialog behavior modeling and task modeling

SDCTD '12 NAACL-HLT Workshop on Future Directions and Needs in the Spoken Dialog Community: Tools and Data
Hype or Ready for Prime Time?: Speech Recognition on Mobile Handheld Devices MASR

International Journal of Handheld Computing Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Amid today's proliferation of Web content and mobile phones with broadband data access, interacting with small-form factor devices is still cumbersome. Spoken interaction could overcome the input limitations of mobile devices, but running an automatic speech recognizer with the limited computational capabilities of a mobile device becomes an impossible challenge when large vocabularies for speech recognition must often be updated with dynamic content. One popular option is to move the speech processing resources into the network by concentrating the heavy computation load onto server farms. Although successful services have exploited this approach, it is unclear how such a model can be generalized to a large range of mobile applications and how to scale it for large deployments. To address these challenges we introduce the AT&T speech mashup architecture, a novel approach to speech services that leverages web services and cloud computing to make it easier to combine web content and speech processing. We show that this new compositional method is suitable for integrating automatic speech recognition and text-to-speech synthesis resources into real multimodal mobile services. The generality of this method allows researchers and speech practitioners to explore a countless variety of mobile multimodal services with a finer grain of control and richer multimedia interfaces. Moreover, we demonstrate that the speech mashup is scalable and particularly optimized to minimize round trips in the mobile network, reducing latency for better user experience.