Over the years, we have experienced dramatic changes in how we interact with computers. We went from flipping switches to stacking punch cards, typing away on hard copy and now, manipulating objects on a desktop using touch screens.
Each transition has provided us with a greater ease of interaction and a new metaphor (switches, direct commands, desktops) for thinking about that interaction. But, advances in speech recognition are completely changing the game.
Siri, Cortana, Alexa and Google Now have given us a taste of what it means to directly communicate our goals and needs to the machine. And just recently, Facebook and Microsoft have both made announcements that they are pivoting towards conversational interfaces while Amazon has increased the support ecosystem for Alexa and Siri's creator announced Viv as a new platform that will be the "intelligent interface for everything." We can assume that both Apple and Google are going to be expanding the reach of Siri and Google Now respectively.
Conversation as an interface is the best way for machines to interact with us using the human tool we already know exceptionally well -- language.
To date, most of the speech-driven systems that we use have been of two main flavors: 1) triggered task models and 2) call and response systems. The first of these systems, triggered task models, actually do things for you. The second are just fun to interact with as entertainment.
As we look forward, there are five core functionalities and interaction dynamics that we can expect to see in conversational systems. Along with the two models just mentioned, we are going to see systems that support search, complex interactive tasks and interactive information access. Some already exist, some are on the horizon and some are here but we don’t know it yet.
Chat as Chat
Many of the chat systems out there are less focused on fulfilling orders and more focused on entertainment. Expanding upon the tradition of Eliza and Parry, these systems generate responses based on a statistical model of how relevant their responses will be to a user’s input text.
The modern take on this approach now includes the use of machine learning to build out more and more relevant responses. In the end, however, these systems neither know what you are saying nor what they are saying. They just know that there might be a good relationship between them.
Tay and its predecessor, Xiaolce, are great examples of this model. The latter has had a successful run in China with millions of happy users. Tay, on the other hand, was transformed in under a day into a misogynistic racist, egged on by 4chan users.
Taken down almost immediately, Tay is a clear example of how systems that don’t actually understand what they are saying can be extremely problematic. They are parrots, and parrots can be trained to say almost anything. And while a parrot may be fun to talk with, you don’t want it to schedule your trip to Barbados.
As we will see, the more that these interactive systems actually know about the world and about you, the more powerful and reliable they become.
Triggered Task Models
At their core, Siri and her counterparts are keyword response systems. They use speech recognition to first identify the words you say and then, depending on the identification of specific trigger terms, funnel the remaining words to one of a small set of programs that will handle the tasks associated with the triggers. You can ask them to “play” music, “turn on” lights, request an “Uber” pickup, or order a “Domino’s” pizza. Within the confines of the terms they know, these systems can provide incredible service.
However, in order to get your pizza from this interaction, the system already needs to know what kind of pizza you want. The conversation is just another way of pushing the “order” button. Putting together a more complex order or changing your preferences is beyond the capabilities of these systems as they stand. They know what you have said and thus what to do in response (“Give me my standard order.”).
For greater complexity, they also need to be programmed with all possible information options that may arise throughout the order and then use that to manage longer interactions (“Do you want a thin or thick crust?”).
When Siri doesn’t recognize any trigger terms in your query, she falls back on the terms to drive search for possible answers. Because we are so used to the dynamic of search, this kind of fall back seems fairly natural. We accept the idea, and rightly so, that when a system cannot figure out what we really want it to do, it will at least try to find us information that is relevant to the words, if not the meaning associated with what we have just said.
Of course, this approach only supports limited interaction. The best way to think of this is that once the search is over and the response is provided, these systems tend to forget what they just told you. The search engine itself may hold onto overall preferences, but these interactive systems do not.
In order to perform more complex tasks, these systems need to remember more of what you have said and what you want.
Complex Task Interactions
The promise of emerging systems such as Viv and the model that Microsoft and Facebook are trying to support is that they will be able to help you through far more complex tasks. Rather than helping you order the same pizza over and over again, these systems will help you put together an evening out, plan and arrange a vacation, or help you with financial planning.
The difference between these and the current state of the various mobile assistants is that they will have knowledge of tasks, the information needed to perform them and the ability to track the information that you have already passed on to them. They will have the knowledge of concierges, travel agents and financial planners. They will then know how to use this knowledge to manage the conversation.
These systems know what they need to know in order to help you (e.g. where you want to go on vacation, when you want to travel, how many people, how much you want to spend, how many kids) and then use that to support the conversation.
The models proposed so far to support this level of task complexity have tended towards highly structured interactions that feel more like stepping through a series of questions on a survey than a conversation. Fortunately, some of the smartest people in A.I. are working right now on this model. Their goal is a model in which the system’s knowledge gaps become the drivers for the interaction rather than where you are in the script.
As these models progress, we will see if they can break out of the narrow ranges of highly specialized tasks and scripted interactions into the realm of broader control of complex systems.
Interactive Information Access
One of the most natural ways in which people use language is sharing information. The dynamic of this kind of sharing is a back and forth where small bites of information are given in response to questions, comments or requests for clarification.
Conversations with our doctors, financial advisors or even our bosses during performance reviews are all about the unfolding of information that is important and impactful. There is no “task” such as ordering groceries that we are trying to accomplish. Instead we are trying to get to the information that is meaningful to us in a way that allows us to best understand it.
This kind of interaction defines the final type of conversational interface that is on the horizon. Conversational interaction with systems that have access to data about our world will allow us to understand the status of our jobs, our businesses, our health, our homes, our families, our devices and our neighborhoods.
The list is endless.
Going well beyond search, these systems combine data analytics to determine the facts defined by the data with natural language generation to produce a more human interaction. And unlike most other conversational systems, theses systems actually know what you are asking about because they know what they are talking about.
While these systems are more ambitious than narrow task-focused or search systems, the substrate for interactive information access already exists in the form of data-driven, advanced natural language generation systems. These systems already map data to meaning and language in order to generate their narratives. Using them for interactive information access comes down to having them wait for questions that they can answer rather than generating complete documents as they’re used today.
It is the difference between getting a report and having a conversation. The information is the same but the interaction is more natural.
Different systems for different tasks
Each of these different approaches to conversational interfaces have different strengths and weaknesses giving them a different role in the chat ecosystem. When considering what model might suit your needs, whether you’re building a model or instructing a team to build or buy one, you need to first consider the nature of the task you want to support.
If you want to support tasks like ordering, planning or arranging complex systems, you do not want to bring in a search or chat-as-chat system. Likewise, if you want to support access to data-driven information, you do not want to choose either of the task-focused models. You want to ensure you aren’t trying to modify a chat-as-chat system that has learned to be a misogynist to help you with your financial planning or explain who is doing well on your sales team.
You want to have a conversation with a system that knows what is talking about.
This article is published as part of the IDG Contributor Network. Want to Join?