HTML's ability to describe layouts and pages was a major factor in the rise of the World Wide Web. But HTML has a fundamental flaw: It assumes a graphical output display on a computer. Five or 10 years ago, that was the natural and obvious thing to do.
More
Computerworld
QuickStudies
But nowadays people want to be able to access the Web when they're away from their desktops, using phones, pagers, handheld devices and even household appliances. While most of these devices have graphical displays, at best those displays are very small, have limited bandwidth, aren't well suited to normal Web browsing and generally don't have keyboards for input or control. In business, many areas of customer support have moved to Web-based systems, and there's a real need to make those systems accessible from any telephone without the benefit of a computer client or visual display.
In other words, we want to be able to talk to our Web pages and have them talk back to us. This is called voice browsing, and it lets users retrieve information from the Web by means of speech synthesis, prerecorded audio and speech recognition. Voice capability can be added to conventional desktop browsers, and as mobile devices become smaller, voice interaction can provide a more practical alternative to tiny keypads and undersized displays.
The World Wide Web Consortium is working to expand access to the Web to allow people to interact via keypads, spoken commands, prerecorded speech, synthetic speech and music. In 1998, the W3C sponsored a voice browsing workshop. The next year, it formed a working group whose members included AT&T Corp., British Telecommunications PLC, Lucent Technologies Inc., Philips Electronics NV, IBM, Motorola Inc. and Nokia Corp. The group is working on interrelated XML-based languages and standards for developing speech applications. Called the W3C Speech Interface Framework, this platform includes the following:
- VoiceXML 2.0, for defining dialogues and specifying the exchange of data between the user and a speech application.
- VoiceXML 2.1, a small set of features that have been widely implemented by vendors.
- Speech Recognition Grammar Specification, for specifying the structure of user input to a speech application.
- Speech Synthesis Markup Language, for specifying just how synthesized speech is rendered to the user -- e.g., the type of voice used and specific pronunciations.
- Semantic Interpretation for Speech Recognition, which defines links between grammar rules and application semantics, so that spoken variations of the same element, such as "Coke" and "Coca-Cola," are treated as equivalent.
- CCXML, for specifying call control functions.
VoiceXML is the most visible part of this framework, while the other elements are essentially infrastructure. VoiceXML leverages the other specifications for creating dialogues that feature synthesized speech, digitized audio, recognition of spoken and DTMF key (i.e., touch-tone) input, recording of spoken input and telephony. VoiceXML hides many of the complexities of telephony platforms.
VoiceXML has features to control audio output and input, presentation logic, flow, event handling and basic telephony connections. Applications built with VoiceXML can include prerecorded audio material, just as HTML can incorporate existing images in a graphical page.
HTML is oriented toward screen layouts that present multiple objects at the same time. Speech, however, is much more linear -- you can hear only one thing at a time -- and so VoiceXML has to control the interaction between the user and the application. In almost all cases, the application and user take turns speaking: The application prompts the user, and then the user responds.
Languages like VoiceXML and its predecessors have to support two kinds of markup: one that describes the text according to its structure or content, and another that controls aspects of how speech is to be produced, such as voice pitch and emphasis.
Kay is a Computerworld contributing writer. You can reach him at russkay@charter.net