|
| United States Worldwide |
|
SpeechActs Overview
SpeechActs was a research prototype developed in the Speech Applications Group in the period 1993-1997 as a testbed for developing spoken natural language applications (see the page on applications). In developing the system, a primary goal was to enable software developers without special expertise in speech or natural language to create effective conversational speech applications, that is, to create applications in which users can speak naturally as if they were conversing with a personal assistant (see the page on user interface issues). Another goal was for SpeechActs applications to work in conjunction with one another on a discourse level without having specific knowledge of the other applications running in the same suite. For example, if someone talks about "Tom Jones" in one application, and then mentions "Tom" later in the conversation while in another application, that second application should know that the user means "Tom Jones" and not some other "Tom." Given the rapidly changing technology, a third goal was to avoid tying developers to specific speech recognizers or synthesizers. We wanted them to be able to use these speech technologies as plug-in components. SpeechActs supported a handful of speaker-independent, continuous speech recognizers: Hark from BBN Dagger from Texas Instruments, and Nuance Communications' recognizers. In addition, the framework used TruVoice text-to-speech (previously from Centigram, now from Lernout & Hauspie) or TrueTalk from Entropic. The architecture of the system made it straightforward to add new recognizers and synthesizers to the existing set.
Framework ArchitectureThe SpeechActs framework is comprised of an audio server, the Swiftus natural language processor, a discourse manager, a text-to-speech manager, and a set of grammar building tools. These pieces work in conjunction with third-party speech components and application-developer-supplied components. The major components illustrated in this diagram are explained below.
User Input and Output: A telephone or microphone converts the user's speech to raw audio. Audio Server: This presents the speech recognizer with raw audio from the user and also presents the user with raw audio from the Text-to-Speech Engine. Unified Grammar: This recognizer-independent grammar, coupled with a lexicon, defines the utterances that can be recognized by the speech recognizer and by the Swiftus natural language processor. Speech Recognizer: Given a stream of audio and a grammar, the recognizer decides when an utterance is complete and presents a list of recognized words to Swiftus. Swiftus Natural Language Processor: Swiftus uses an augmented context-free grammar, derived from the Unified Grammar, to parse the word list and to present the Discourse Manager with feature-value pairs that encode the semantics of the utterance. Discourse Manager: The feature-value pairs pass through Discourse Manager snooper functions which scan for ones that require special action (e.g., a request to switch to a different application). If no snooper intervenes, the pairs are passed to the current application. The Discourse Manager also supplies a discourse stack and specialists. Application: Each application acts on the pairs and synthesizes strings appropriate for speaking to the user. Applications can call upon Discourse Manager specialists to perform services such as conversion of relative dates (e.g., "the Wednesday after New Year's" becomes 3 January 1996). Text-to-Speech Manager: The TTS Manager accepts strings from applications or from the Discourse Manager and presents them to the TTS Engine. Text-to-Speech Engine: This converts strings to raw audio, which is then passed through the Audio Server to a telephone or speaker. For more detailed information about the system architecture, refer to "SpeechActs: A Framework for Building Speech Applications" (9 PostScript pages) or the more recent IEEE Computer paper "SpeechActs: A Spoken Language Framework" (8 PDF pages).
| ||||||||||||||||||||