Hard Cores

Re-engineering programs to work on multicore chips is already difficult but will get even harder as the number of processors continues to multiply.

Putting two or more processor cores on a single silicon chip has been one of the most important milestones in computing in recent years. It allows users to continue to reap the benefits of Moore's Law while sidestepping the extreme difficulty of manufacturing, powering and cooling single microprocessors beyond 4 GHz. Chip multiprocessors (CMP) also offer the opportunity to significantly boost the performance of applications that are able to share them.

But the benefits of parallel processing don't come easily. Programmers have to behave differently, as do compilers, languages and operating systems. If application software is to reap the benefits of CMPs, new skills, techniques and tools for designing, coding and debugging will be needed. Fortunately, both hardware and software vendors are developing tools and methods to make the job easier.

"Multicore chips are going to be a challenge for software developers and compiler writers," says Ken Kennedy, a computer science professor at Rice University in Houston who specializes in software for parallel processing. "If you look at chip makers' road maps, they are doubling cores every couple of years, sort of on a Moore's Law basis, and I'm worried we are not going to be able to keep up."

Desktop applications that traditionally have been written for one processor will increasingly be written to exploit the concurrency available in CMPs. Meanwhile, server applications that have for years been able to use multiple processors will be able to distribute their workloads more flexibly and efficiently. Virtualization, another important trend in computing today, will be made easier by CMPs as well.

Keeping up with CMPs is the focus of intense activity at a number of companies, including Microsoft Corp. Researchers there who are developing CMP tools are focusing on two broad areas: how to find errors in code written for multiple processors, and how to make it easier to write reliable software in the first place.

"A lot of the techniques we have used with sequential code don't work as well, or at all, with parallel programs," says Jim Larus, manager of programming languages and tools at Microsoft Research. "In testing, you typically run your program with a lot of data, but with parallel programs, you could run your program 1,000 times with the same data and get the right answer, but on the 1,001st time, an error manifests itself."

This ugly trait results from "race" conditions in parallel code, in which one process is expected to finish in time to supply a result to another process -- and usually does. But because of some anomaly such as an operating system interrupt, occasionally it does not. Such bugs can be extremely hard to find because they are not readily reproducible.

The tools Larus' group is developing allow more controlled testing so a programmer can, for example, vary the timing of two threads to check for race errors. The tools will eventually be offered commercially as part of Visual Studio, Larus says, "but we have a long way to go."

Microsoft Research is also trying the KISS -- or "keep it strictly sequential" -- model. KISS transforms a concurrent program into a sequential one that simulates the execution of the concurrent program. The sequential program can then be analyzed and partially debugged by a conventional tool that only needs to understand the semantics of sequential execution.

Microsoft and others are also working on a new programming model called software transactional memory, or STM. It's a way to correctly synchronize memory operations without locking -- the traditional way to avoid timing errors -- so that problems such as deadlocking are avoided. STM treats a memory access as part of a transaction, and if a timing conflict occurs with some other operation, the transaction is simply rolled back and tried again later, similar to the way today's database systems work.

"The idea is that the programmer, instead of specifying at a very low level how to do this synchronization, basically says, 'All the code between this point in the program and this other point, I want to behave as if it were the only thing accessing data at this time. System, go make that happen,'" says Larus.

STM -- "a really hot research topic these days" -- may someday be implemented in a combination of hardware and software, says Larus. In the meantime, programmers will have to use fine-grained locking -- in which individual rows or elements of a table are locked, rather than the whole table -- to ensure correct synchronization in parallel programs. The more parallel threads there are, the more difficult that becomes.

Microsoft products won't require significant changes to scale from two processors (or processor cores) to four or eight processors, other than perhaps some performance tuning, according to Larus. "But when you start getting to bigger-scale machines, the question becomes, What are the bottlenecks?" he says. "If you have more processors, you have to have increasingly fine-grained locking."

Mind the Memory

Aachen, Germany-based MainConcept AG develops software for encoding and decoding signals such as high-definition video, and it is an accomplished practitioner of such fine-grained locking. Video processing is a computational challenge; high-definition movies have to be processed in real time, and each frame takes up 1MB of memory, with each slice of the frame requiring extensive mathematical manipulation.

MainConcept has tuned its software to run on systems with dual-core chips, with the cores working on frame slices in parallel. The company has seen performance improve by a factor of 1.8, says MainConcept CEO Markus Monig. He says performance has improved by another factor of 1.8 through the use of two dual-core processors. Such near-linear speedups are very close to ideal, with any gain over 1.5 considered good.

The software uses and searches "huge areas of memory," Monig says. If the software is carefully constructed, it can use on-chip cache memory for much of its work, speeding processing. MainConcept uses performance-tuning tools from Intel Corp. to tune the software to the hardware architecture. Intel's VTune Performance Analyzer helps optimize the code, and its Thread Profiler and Thread Checker help balance the work of multiple threads and identify bottlenecks in multithreaded codes.

But Monig worries that he won't be able to boost performance linearly as the number of processor cores increases. "We don't expect this for eight-core, 16-core and beyond," he says. "The faster and the more cores there are, the more the memory access is the bottleneck."

Code writers trying to exploit multiple processors or processor cores face three challenges, says James Reinders, director of business development for Intel's software development products. The first is scalability -- how to keep each additional processor busy. A threefold performance boost on a four-processor system is "darn good," he says; anything more is "exceptional."

The second challenge is "correctness" -- how to avoid race conditions, deadlocks and other bugs characteristic of multiprocessor applications. Intel's Thread Checker can find threads that share memory but do not synchronize, which, he says, "almost always [indicates] a bug."

The third challenge is "ease of programming," Reinders says, modern compilers can help by finding and exploiting opportunities for parallel processing in source code. The programmer can help the compiler by including "a few little hints" in the code, he says.

These "hints" are available in a new standard called OpenMP, specifications for compiler directives, library routines and environment variables that can be used to specify parallelism in Fortran, C and C++ programs. "The alternative to using these extensions is to do threading by hand, and that takes some clarity of thought," Reinders says. "So OpenMP can be tremendously helpful."

Kennedy agrees. "My philosophy is the programmer should write the program in the style that's most natural, and the compiler should recognize the properties of the chip that have to be exploited to get reasonable performance," he says.

Tom Halfhill, an analyst for In-Stat's "Microprocessor Report" in San Jose, says some software developers are "tearing their hair out" over the new CMP systems. "Rewriting the software for multithreading is a lot of work, and it introduces new bugs, new complexities, and the software gets bigger, so there is some resistance to it."

He says Fortran and C++ don't contain parallel constructs natively, whereas Java does, so the move to CMP may boost Java's fortunes.

But the CMP train has left the station, whether software developers like it or not. Intel says 85% of its server processors and 70% of its PC processors will be dual-core by year's end. Halfhill predicts that in five years, microprocessor chips in servers will have eight to 16 cores, and desktop machines will have half that number. And, he says, each core will be able to process at least four software threads simultaneously, a technique Intel calls hyperthreading.

The angst today over optimizing software for CMPs is a little like that of 20 years ago when developers obsessed about the amount of memory and disk space available, says Halfhill. Now both resources are so cheap and plentiful that most applications just assume that they will get whatever they need.

"In five to 10 years, we'll get to the same place with processor cores," Halfhill predicts. "There will be so many that the operating system will just dedicate how many cores the application needs, and you won't worry if some cores are being wasted."

Up Next

Intel's Reinders says CMPs will give a boost to hardware virtualization -- by which a computer is made to run multiple operating systems -- with CMPs allowing for a more fine-grained partitioning of a machine. It is possible to carefully control and allocate processing resources by specifying, for example, that a certain application may use two cores and no more, while some higher-priority application gets four cores. "If you map virtualization onto individual cores, you can get more predictable response," he says.

CMPs offer performance advantages over systems with multiple, separate processors, because interprocessor and processor-memory communication is much faster when it's on a chip. Rice University's Kennedy predicts that will lead to hybrid systems consisting of clusters of computers running multicore processors. "Then you have two kinds of parallelism: cross-chip parallelism, perhaps with message passing and a shared memory, and on-chip parallelism," he says. Then, functions that require very high interprocessor bandwidth can be put on a CMP, and those that don't can be distributed across the cluster. Various types of transaction processing and database systems could make good use of such an architecture, Kennedy says.

While everyone agrees that more processors, more cores and more power can generally be put to good use in big enterprisewide systems, the future of CMPs on desktops and laptops - where even single-core processors are idle much of the time -- is not quite so clear. Multithreaded game software can put the parallelism to good use, and so perhaps can a few specialized applications, such as speech recognition.

Single-processor-core PCs today can take advantage of multitasking, in which one thread, for example, deals with display while another does a long-running computation and another goes out to a server. But what to do with eight processor cores all running at 3.6 GHz?

Microsoft's Larus says he knows people are probably having trouble imagining how a single user might take advantage of that kind of system. "To be honest, so are we," he says. "This is a subject of very active discussion here."

THE MANY FACETS OF MULTIPROCESSING

Intel's Pentium Processor Extreme Edition uses two processor cores, each with its own on-chip cache and each running at the same speed. Using Intel's Hyper-Threading Technology, each core functions as two logical processors, enabling four-thread functionality, in this example balanced between integer and floating-point arithmetic. The processor can run multiple applications simultaneously with background tasks such as real-time security and system maintenance. The chip also can use Intel's Virtualization Technology to run multiple operating systems and/or applications in independent partitions.

The Many Facets of Multiprocessing

(Click image to see larger view)

Source: Intel Corp.

Copyright © 2006 IDG Communications, Inc.

Bing’s AI chatbot came to work for me. I had to fire it.
Shop Tech Products at Amazon