Q&A: Google's Alfred Spector on the hot seat

Google's VP of research talks about where search is headed, what difference it will make and what makes his company tick.

What kinds of things are coming out of Google research? Google Search by Voice is an example. Wouldn't it be nice to tell a handheld device what you want to know? It's an interesting problem in that one needs to understand spoken voice, correlate that with the most likely search queries, and then show the search result -- doing all this with an intuitive, speedy user interface. This work comes out of our basic research efforts in speech recognition. The Google Mobile App with voice search has been on the iPhone for a few months, and we just released Google Search by Voice on Android [Google's open-source software for mobile devices].

What longer-range projects are you working on? With all the data on the Web, shouldn't we be able to take that information and create a

database of concepts -- or entities -- and the relationships between them? For example, consider the "is a" relationship. A dog is a pet; a son is a boy, for example. But AI often thought these relationships had to be taught to a system by experts. But the question we have is, can we learn all these things from a huge amount of interaction with a very large corpus of information? If so, we could codify and structure significant aspects of knowledge. The system could automatically glean many kinds of information. It's a very long-range effort.

What would you do with that database of relationships? Let's imagine our search software is responding to a query on pets, but we find articles on dogs and cats, but without the word pets. This database of relationships would let Google know that the article is probably about pets because there are multiple instances of a subcategory of "pet." The database would enable much better search and better language translation because there'd be a better understanding of the meaning of the words.

We believe it may be possible to build up these huge sets of concepts and the relationships between them. You could gain two benefits: more-focused results and probably also results that wouldn't otherwise be found.

What are some other advances coming in search technology? We increasingly are interested in returning not just Web pages, but multiple forms of information: images, books, blog entries, videos, tables, fax, etc. With this type of "universal search," for example, a picture might be the best thing to return. But it could be a table, an audio file, etc.

We are also interested in allowing new types of input. Perhaps, rather than a text string, a query could be made by referring to an image one has already found. This requires solving complex image-recognition problems.

Can search do a better job of tailoring results to individual users? It's clearly an interesting goal for a system to take into account our interests, what we already know and, perhaps, even how we individually learn. We experiment and think about those questions a lot. But it can be tricky. For example, if I'm doing a medical query, should it be biased [toward] what I might be worried about as a middle-aged male, or might I be concerned about my mom?

Do you have plans to go after that huge body of information on the Internet that is not currently searched? There is stuff on the Web, the so-called Deep Web, that is only "materialized" when a particular query is given by filling fields in a form. Since crawlers only follow HTML links, they cannot get to that "hidden" content. We have developed technologies to enable the Google crawler to get content behind forms and therefore expose it to our users. In general, this kind of Deep Web tends to be tabular in nature. It covers a very broad set of topics. It's a challenge, but we've made progress.

What's Google's computer infrastructure like? Google uses what is now termed "cloud computing." We have numerous clusters, each containing large numbers of computers. The clusters run a distributed computing infrastructure that uses Linux on each computer. All the computers are then tied together with high-performance networking and distributed computing software. For example, we have built and deployed a global file system called the Google File System that provides scalable, fault-tolerant storage; a record-oriented data storage system for tabular data called BigTable; and a computational programming model called MapReduce that allows our batch jobs to use the inherent parallelism in our clusters.

[As for] the exact number of machines, locations and clusters we have, suffice it to say that we have so many individual elements in our fabric that an enormous amount of attention is paid to fault tolerance, because with so many elements operating, there are exceedingly frequent component failures.

Could other companies emulate that kind of architecture? First, there really are economies of scale in running systems that can support many services on a common fabric. Second, relating to the services model we espouse, there are great simplifications to releasing software as a Web-based service, because services don't have to be tested and deployed on a large number of different customer environments. Instead, software can be released to a small number of machines in a more controlled cloud and then accessed by browsers.

A third benefit is that since a software service is a logically centralized notion, the history of interactions of very many users can be aggregated and thus be the basis for various types of self-learning systems. Google uses this concept to learn to correct spelling mistakes, but businesses can use similar notions to better meet the needs of employees or customers by learning, for example, of common errors, unfulfilled product searches, etc.

Copyright © 2009 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon