What China's supercomputing push means for the U.S.

China is developing its own software and building its own infrastructure to create a tech industry, says a top computer scientist at the DOE's Argonne National Laboratory

1 2 Page 2
Page 2 of 2

The new Chinese system will use 24 MW (megawatts) at peak when cooling is considered. What are your observations about its power use? That's an awful lot. The raw number is staggering when you think it's about $1 million per year per megawatt. That machine at peak would run $24 million a year in electricity. The goal for exascale is in the 20-30 MW range. In some sense, this shows that if we do nothing, we're stuck at this power rating.

Is there any agreement about how to lower power? There are several promising venues. One is the integration of memory on the chip. Right now, memory accounts for a healthy fraction of that power, and having it external to the CPU wastes power. Pulling it on to the CPU, that memory, with 3D chip stacking or other techniques, will make a big dent.

The other promising technology: Right now, all of our system memory is RAM, and RAM is very inefficient in terms of power. There are technologies that several companies are developing that could use NVRAM. It might not be quite as fast as RAM but the power difference is spectacular, so with that in mind, you can imagine developing systems in the future where some fraction of the memory is actually NVRAM, a smaller fraction of overall memory is RAM, and we get a big power savings. But the thing that we haven't tapped into at all really is managing power as a resource from the software. We just don't have a way right now to automatically move up or down the power in order to take advantage of processors being idle or not idle in a large HPC computation. So there are a lot of software changes that have to happen.

How will the power software management work? Google just wrote a paper, The Tail at Scale. When you do a Google search, it is searching several different servers for little bits of information that are then all pulled together, and that result is then sent back to you. So let's say that there are 20 machines that have to be touched, and a little bit of data from each of the search pieces is assembled and sent back to you. If one of those machines, and this is the part about the tail, comes back with an answer in a slightly longer time, the end result of the query is as long as the longest component. That's frustrating. We find that a little bit interesting, because [Google has] rediscovered what in high-performance computing we have known for a couple of decades, which is this concept of bulk synchronous computation, where you send out hundreds of thousands of tiny work objects to be done, one on each CPU, and if any one of those hundreds of thousands of chips runs slower, any one of them, then your result is as slow as the slowest one.

Let's say you paid $100 million for your machine, and you have all of those CPUs working hard on your problem, and one of them is slightly slower, then it's like degrading the value of your machine by 50% or more. That's how we do many of the computations right now. In terms of power management, the compiler, and the code, and runtime system have to cooperate in deciding when we can speed processors up and when we can slow them down. It can't be a self-deciding component.

What you try to do is make sure all the processors run at exactly the same speed, and they always return the answer at the same speed, so you don't have any lagging slowdown processor, or you try to cull [the laggards] out before they even run. Sometimes there are ways to determine that there are parts of a machine that aren't running as fast. But sometimes it's not so easy to do that.

With the size of memory that we have today, some part of your machine is likely to be correcting a single bit error at any given moment. Single bit errors can be detected and corrected automatically, well, it still takes a few CPU cycles so that means that that processor is still going to be late, just a fraction, to the computation because it had to clean up this fault. As we move to lower power, we also recognize that faults go up. The closer you are to operating at the jagged edge, the threshold of computing, the more noise there is in the system, and therefore the more faults there will be. This issue is quite a complex one.

What impact do you think China's new system will have, or should have on exascale development in the U.S.? My personal hope is it is a demonstrator of how hard work and investment in technology is important to China, and how that should be important to the U.S. as well.

It isn't just exascale. It's this notion that cutting-edge large science systems in computation drive a lot of research and lot of industry. Our investment in this space is really key to remaining competitive and being the innovators of this space. One of the things that's interesting about China's announcement, in my opinion, is they geared up this company, Inspur, to sell these machines inside China. They are building the infrastructure to churn out these systems within China and the question is then, who is next? Will they be shipping any to India? Will they eventually have the expertise to ship these to Brazil and to other countries?

So in sum, is it correct to say that China is accomplishing multiple things here: They are getting their science together, fueling a new IT industry, and are potentially creating new exports? It's exactly that. They are designing their own chips. They have geared up a set of students and professors, industry, and semiconductor companies to build this infrastructure. What about the software? They are not going to download software from around the world. They are designing teams to build the software. Are they preparing to export this system? You bet. They aren't just building this in the university, they've included this company, and that company will then be able to make multiple versions of this.

This article, What China's supercomputing push means for the U.S., was originally published at Computerworld.com.

Patrick Thibodeau covers cloud computing and enterprise applications, outsourcing, government IT policies, data centers and IT workforce issues for Computerworld. Follow Patrick on Twitter at @DCgov or subscribe to Patrick's RSS feed . His e-mail address is pthibodeau@computerworld.com.

See more by Patrick Thibodeau on Computerworld.com.

Copyright © 2013 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
  
Shop Tech Products at Amazon