Honors Program: Iowa State University

Project Name: Cracking the Corn Genome Code

1 2 3 4 5 6 Page 4
Page 4 of 6

The Importance of Technology

How did the technology you used contribute to this project, and why was it important?

The goal of the project is to develop information technology solutions to enable and accelerate genomics research. In turn, it relied on sophisticated IT solutions at many levels, including hardware, software and supercomputing technologies. The software can be run on any multiprocessor platform that supports the MPI. It was developed and optimized on two platforms. One is inexpensive commodity Linux clusters with a freely downloadable MPICH implementation of MPI from Argonne National Laboratory (http://www-unix.mcs.anl.gov/mpi/mpich/). The other is the IBM Blue Gene/L supercomputer, for maize genome assembly work and for solving other applications with very large data.

The commodity platform used initially was a 32-node, dual-processor, 1.13-GHZ Intel Pentium III cluster connected by Myrinet. This interconnection network from Myricom Inc. enables multiple pairs of nodes to simultaneously communicate with each other, and this parallel communication is exploited by the computational methods and software. During the latter stages, an additional 42 dual-processor, 3-GHZ Intel Xeon nodes were added to the cluster. All of the servers were manufactured by Sun Microsystems Inc. Many of the users of the software use similar commodity clusters because of cost economy and ubiquity. On the high end, much of the recent work on Blue Gene/L took place using the 1,024-node single-rack system procured by Iowa State researchers for this purpose. Larger systems at the IBM Rochester facility have been used courtesy of IBM. The software has been tested and large-scale problems solved on up to 8,192 nodes.

The IBM Blue Gene/L technology used was very important to set this project apart. It allowed demonstration of the scalability of the developed computational methods and software. The use of a 1,024 node system was instrumental in reducing assembly time to less than a day. It showcased for the first time how powerful supercomputing technology can be used to advance computational genomics. The high-speed torus and tree interconnection networks within the Blue Gene/L system support high-speed parallel communication, which is vital to some phases of the software. Overall, the ability of the developed methods to utilize a large number of processors and uniformly spread memory consumption across processors perfectly matched the Blue Gene/L system, which provides a large number of processors for a given main memory footprint and at relatively modest power requirements.


What are the exceptional aspects of your project?

A lot of research has been conducted in the past to assemble genomes and to do other types of analysis involving DNA fragments. About a dozen software programs are already available, mostly developed in the context of specific genome sequencing projects. Some of the exceptional aspects of the project carried out are:

1) To rethink from scratch the computational methods for large-scale DNA sequence analysis, which resulted in a radically different approach, saving memory usage and accelerating performance.

2) Development of a framework that unifies different applications of large-scale sequence analysis and that is not restricted to assembly of uniformly sampled sequences.

3) Developing methods and software that scale to thousands of processors and reduce assembly time by two orders of magnitude.

How is it original?

The PaCE software is the only software that can scale to thousands of processors. In comparison, other software programs are mostly sequential, and the few that make use of parallelism do so in a rudimentary fashion, using manual partitioning of data and running different programs on different processors. PaCE is stand-alone software that can be used on any platform with MPI, which is freely available. It continues to be the only software of its kind and that is most effective in addressing large-scale sequence analysis problems and matching the data analysis requirements of new and upcoming high-throughput sequencing strategies.

Is it the first, the only, the best or the most effective application of its kind?   All of the above.

1 2 3 4 5 6 Page 4
Page 4 of 6
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon