Navigating the Information Superhighway Using Spoken Language Interfaces

  • Authors:
  • Victor W. Zue

  • Affiliations:
  • -

  • Venue:
  • IEEE Expert: Intelligent Systems and Their Applications
  • Year:
  • 1995

Quantified Score

Hi-index 0.00

Visualization

Abstract

Computers are fast becoming a ubiquitous part of our lives,brought on by their rapid increase in performance and decrease incost. With their increased availability comes the correspondingincrease in our appetite for information. This trend is reflectedin the astronomical growth in the number of Internet hosts, thenumber of home pages on the World Wide Web, and the correspondingnetwork traffic. For example, the 1994 collision of theShoemaker/Levy comets with Jupiter increased the demand for Jupiterimages at one host by 40,000 over a one-week period. Vast amountsof useful information are being made widely available, and peopleare using it routinely for education, decision-making, finance, andentertainment.The advent of the Information Age places increasing demands onthe notion of universal access. For information to be trulyaccessible to all -- especially the technologically naive --anytime, anywhere, one must seriously address the issue of userinterface. An interface based on a user's own language isparticularly appealing, because it is the most natural, flexible,and efficient means of communication among humans.After many years of research, spoken input to computers is justbeginning to pass the threshold of practicality. The last decadehas witnessed dramatic improvements in speech recognitiontechnology, to the extent that high-performance algorithms andsystems are becoming available. In some cases, the transition fromlaboratory demonstration to commercial deployment has alreadybegun. Speech input capabilities are emerging that can providefunctions like voice dialing ("Call home"), call routing ("I wouldlike to make a collect call"), simple data entry (entering a creditcard number), and preparation of structured documents (performing aradiology report).Speech recognition is a very challenging problem in its ownright, with a well-defined set of applications. However, many tasksthat lend themselves to spoken input -- such as making travelarrangements or selecting a movie -- are in fact exercises ininteractive problem solving. The solution is often built upincrementally, with both the user and the computer playing activeroles in the "conversation." Therefore, several language-basedinput and output technologies must be developed and integrated toreach this goal. Regarding the former, speech recognition must becombined with natural language processing so the computer canunderstand spoken commands (often in the context of previous partsof the dialogue). On the output side, some of the informationprovided by the computer -- and any of the computer's requests forclarification -- must be converted to natural sentences, perhapsdelivered verbally.In a typical conversational system, the spoken input is firstprocessed through the speech recognition component. The naturallanguage component, working in concert with the recognizer,produces a meaning representation. For information retrievalapplications illustrated in this figure, the system can use themeaning representation to retrieve the appropriate information inthe form of text, tables and graphics. If the information in theutterance is insufficient, the system may choose to query the userfor clarification. Speech output can also be generated byprocessing the information through natural language generation andtext-to-speech synthesis. Throughout the process, discourseinformation is maintained and fed back to the speech recognitionand language understanding components.This article illustrates the usefulness of an intuitive,speech-based interface using Galaxy, a system under development atMIT's Laboratory for Computer Science that enables universalinformation access using spoken dialogue. Galaxy differs fromcurrent spoken language systems in a number of ways. First, it isdistributed and decentralized: Galaxy uses a client-serverarchitecture to allow sharing of computationally expensiveprocesses (such as large vocabulary speech recognition), as well asknowledge intensive processes. Second, it is multidomain, intendedto provide access to a wide variety of information sources andservices while insulating the user from the details of databaselocation and format. Finally, it is extensible; users can add newknowledge domain servers to the system incrementally.