What will YOU do with 100 cores?

There is an inflexion point approaching the computing industry. Consider the following statement and see if it could apply to your business.

"Any software business or embedded computing provider that derives competitive advantage from the execution speed of a single thread of code is at risk."

Why is this true and what does it mean?

Most software written for computers is written in a sequential programming language like C. Most secondary and tertiary computer courses teach this type of programming skill. All of it is rapidly becoming out of date, and much of the trillions of lines of software that has been developed in the past, may need to be re-written or become obsolete.

Multi-core processors are about to change everything. Processors have taken a dramatic shift in architecture over the past few years from superscalar to multi-core. A multi-core machine is essentially a parallel processing architecture known as Multiple Instruction Multiple Data (MIMD). This architectural trend started with the Intel Core Duo and has continued to Opteron, PowerXCell, UltraSparc, Cortex, SuperH and so on. All of these processor platforms are shipping multi-core chips. This trend will continue over the next decade. At the IEEE ISSCC conference in Feb 2010, Intel announced a 48-core processor chip. By 2015 we will likely have over 100 cores on a “many-core” processing chip in our notebook computers.

What will you do with 100 cores? Not much is my guess. If you open the Performance tab on the Task Manager of your Corei7-based computer you may notice that most of the 8 CPU threads are underutilised. This is because most of the software written for a PC executes as a single sequential thread on one of the processors. In a 100 core system, much of your software will have access to less than 1 per cent of the available computing power of the machine.

Yet as customer problem sets continue to grow, the computation needed to process them will also grow and therefore execution times will get longer. Even though a 100 core computer will have 25 times the processing capability, a single program will actually be slower unless it is re-written.

A large amount of software will need to be re-written for multi-threaded execution to exploit multi-core systems. This is a difficult undertaking because there is a shortage of people who know how to do it. At the Intel Developers Conference in 2009, Intel estimated that lt;2 per cent of the world’s programmers understand multi-threaded programming and most of these are developing computer games.

Multi-threaded programming remains a graduate level course in many degree programs and there are few tools available to assist with the re-architecture of legacy software. It follows that even those organisations with the most adaptive of strategies, may not be able to find or train the talent needed to re-architect their software in time. The CAD software industry is an example of one that could be redefined by the proliferation of multi-core computing.

Why we are where we are

The transition to multi-core has occurred due to the escalating power consumption of advanced processors hitting a ceiling of 130 Watts.

The chart below shows the power consumption of Intel processors since 1970. With the Itanium, Intel processors peaked at 130 Watts. The x86-64 compatible processor architectures have been optimised over many years to extract the maximum performance from a single thread of sequential code (Using techniques like multiple on-chip phase-locked loops, Reduced Instruction Set Computing (RISC) architectures, fine-grained pipelining, branch prediction, speculative execution etc).

When processors hit the 130W power ceiling, the only way to get more performance (and continue to deliver to Moore’s Law) is to place several cores on the chip. Furthermore, the cores have hit a peak clock rate of 4GHz (called the Clock Wall).

This is in part due to the fact that large synchronous digital chips in 32nm CMOS are not much faster than those in 45nm and 65nm. One reason for this is that the clock requires increased overhead in deep sub micron technology nodes.

Typically, about 10 per cent of the clock period is overhead to cover variance of device parameters and operating conditions. In technology nodes below 45nm, the increased variance of device parameters has required an increase in the amount of overhead added to the clock. So while typical device performance may improve, the overall clock period does not. Therefore, the performance available from any single core has essentially peaked – and herein is the problem for much of today’s software.

 What will YOU do with 100 cores?

There are considerable research efforts underway to develop tools that assist with mapping programs written in “C” to parallel processors. This is a very difficult problem. The Wikipedia page on Automatic Parallelization explains why this is a difficult and as-yet unsolved problem:

“The goal of automatic parallelization is to relieve programmers from the tedious and error-prone manual parallelization process. Though the quality of automatic parallelization has improved in the past several decades, fully automatic parallelization of sequential programs by compilers remains a grand challenge due to its need for complex program analysis and the unknown factors (such as input data range) during compilation.”

Next: The shift to new languages

Page Break

Researchers are now shifting attention to new languages that enable the description of the concurrency of the problem.

At NICTA, we are taking the approach of building domain-specific languages on top of a functional programming paradigm. Unlike imperative languages (e.g. C), a functional language requires the user to explicitly specify the relationship between computations.

This enables the automatic generation of a parallel implementation by a compiler that maps concurrent threads onto multiple cores to achieve speed-up. Indeed, NICTA’s Scalable Vision Machines project is developing a domain specific language for describing computer vision and real time image processing algorithms; and aims to automatically deploy their implementation to typical heterogeneous processing architectures found in smart IP camera systems.

For example we are mapping algorithms to a multi-core processor with a (Single Program Multiple Data) graphics co-processor. Some approaches being considered by researchers and developers are Intel Concurrent Collections, F Sharp, CUDA, OpenCL and Haskell.

The cores will get simpler.

When scaling to 100 cores, it makes less sense to use high performance super-scalar cores (like the x86) as the individual processing units. To keep the power consumption at a minimum in a many core system, more computationally efficient cores will be needed. This will be true for power-efficient Ultra High Performance Computing (UHPC) systems. The Silverthorne architecture (Atom) has a computational efficiency that is 3 times that of the Nehalem (Corei7 and Xeon) and is a better choice for a many-core system.

This trend will continue, and we could see chips with over 1000 cores before 2020. These chips will have substantially simpler cores than the ones used in processors today. Indeed, these systems will start to resemble what computer architects have for decades called “Massively Parallel Processors (MPP)”.

The age of spatial computing

In today’s systems, programs written in a high level language like C are sequenced into a single processor by a compiler. The single processor is called a Central Processing Unit (CPU). The focus on a “central” processing unit has led to several processes (for multiple users) being time-multiplexed into a single, complex processor using a multi-tasking operating system like Linux.

However, when we have 1000 cores on a chip – the concept of a Central processing unit is no longer relevant. In an MPP system, we may start to see programs being mapped across many cores (spatial computing) rather than sequenced into a single processor. This type of spatial mapping will lead to a fundamental shift in the way that software is developed and mapped to MPP arrays. It will also redefine what we mean by an operating system.

Continuing this trend, these MPP systems may eventually start to resemble FPGAs with programmable cores and local interconnection networks. We could imagine that some day, reconfigurable computing and general purpose processing may in fact merge.

Next: What you should do

Page Break

What should I do?

You can assume that the performance you get from your single threaded program today is the most you are ever likely to get. Map out the execution times for increasing sizes of customer data sets over the next 5 years.

How will your program perform in the future? Imagine how the customer will feel when they realise your program with their data sets are squeezed onto a single core in their 100 core machine.

You should be thinking about how to partition it across 2 cores. Then transfer the same executable onto an n core machine.

How will you partition up the data sets to get a speedup across n cores? What speed-up would you get?

Can you predict the expected speed-up for n = 2...100 and will this be sufficient to maintain your competitive advantage?

If you understand the above and are comfortable with your expertise in this area then you could be one of the 2 per cent that Intel speak of. If not, you can wait for parallelising C compilers to emerge from research labs, or you can take matters into your own hands and undertake a course in multi-threaded programming.

As individuals, it is our responsibility to manage our personal value proposition. If you are trained in multi-threaded programming, you will have a skill that will be highly sought after over the next decade.

No matter how proficient you are at programming C or Java, you should consider adding multi-threaded programming to your set of skills.

This is NOT something you should try to learn on your own. It is MUCH harder to learn than sequential programming. If you are a Technical Manager, you SHOULD invest in retraining your key staff in multi-threaded programming. You might also monitor the research activities in concurrent programming languages (like those listed above). You can be sure that your competitors will.

Dr. Chris Nicol is the Chief Technology Officer, NICTA. Contact the author, at chris.nicol@nicta.com.au for more information on multi-threaded programming.

NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. In addition to federal funding NICTA is also funded and supported by the Australian Capital Territory, New South Wales, Queensland and Victorian Governments, The Australian National University, Griffith University, University of Melbourne, University of New South Wales, University of Queensland, Queensland University of Technology and The University of Sydney.


Copyright © 2010 IDG Communications, Inc.

Shop Tech Products at Amazon