PathForward: US Dept. of Energy Awards $258M in Research Contracts To Develop Exascale Supercomputer Technology
by Ryan Smith on June 15, 2017 2:06 PM ESTEven though the major US national laboratories are just now starting to take delivery of the supercomputers they ordered a few years back, due to the long and complex development process for these projects, the US Department of Energy(DOE) has already been focusing on the next round of supercomputers for the next decade. Under the Exascale Computing Project, the DOE expects to develop & order one (and in the end, likely several) exaFLOPS-capable supercomputers, 50 times more powerful than the generation of supercomputers being installed now.
A long-term project expected to take several years altogether, the Department of Energy and its laboratories have already been working on it for nearly two years now, slowly building towards ordering the final computer. To that end, today the project is taking its next step forward with the announcement that the DOE is awarding $258 million in research contracts to six of the US’s leading technology companies.
At a high level, the significance of this project is more than just supplying an exascale system: a major goal of the project is to figure out how to build such a system. Researchers have known for some time that traditional supercomputing paradigms won’t scale very well to exaFLOPS-level performance, as power efficiency, reliability, and interconnect performance would all struggle at those performance levels. As a result, to get the exascale systems the DOE ultimately would like to have – and to get those systems in a timely fashion to ensure US leadership in the field of supercomputing – it has taken a greater role in the research and development of the required technologies under the PathForward program.
To that end, today the department is announcing that it is awarding a total of $258 million in R&D contracts to major US technology firms to help spur them to develop the necessary technologies. These contracts will be going to a veritable who’s who of major US tech firms: AMD, Cray, Hewlett Packard Enterprise, IBM, Intel, and NVIDIA. All told, the participating companies will be working over a three year contract period, with the respective firms kicking in their own money – to the tune of at least 40% of the project cost – to help develop the technologies needed to build an exascale computer for 2021.
Overall, the DOE’s R&D program is intended to spur development in three areas: hardware, software, and application development. Hardware is of course the biggest issue: how do you build processors energy efficient enough to do 1 exaFLOPS of work in under 30 megawatts, especially at a time when Moore’s Law is slowing down? Even then, how do you actually connect those systems together in a meaningful manner?
The answer to that is to pull together the nation’s largest hardware firms – all of whom already have supercomputer experience – and help them to develop the next level of technology. Unsurprisingly then, the plan calls for everyone to play to their strengths: Cray and IBM working on system level challenges, while HPE develops their Memory-Driven Computing architecture that is based around byte-addressable non-volatile memory and new memory fabrics. Meanwhile Intel, AMD, and NVIDIA are all working on processor technology for the project, along with I/O technology in the case of the former two.
The DOE is still years away from awarding a contract for a complete system – and such a contract will inherently hinge on the outcome of the aforementioned R&D efforts – but at a very high level it’s easy to imagine what such a system will look like, based on the companies involved. The new systems already being brought online, such as Summit, make heavy use of GPUs and other wide processors, and at a pure processing level this looks likely to be a major component of exascale systems as well. What is likely to be farther off of the beaten path for these systems are the storage/memory and interconnects; particularly how these can be used to actually make an exaFLOPS worth of processors work together in an efficient manner.
Not significantly discussed in today’s DOE announcement, but still a big part of the project, will be the software to run on these systems. The issue here being much the same as the system interconnects, that is, actually getting applications and libraries that can scale to as many threads as it would take to fill an exascale system. Some of this will be on the application development side, while other parts will come down to building supporting libraries that are up to the task.
Finally, not to be overlooked are the stakes for the Exascale Computing Project itself. For the companies involved, these research contracts are likely to lead to lucrative computer contracts down the line. Meanwhile for the US DOE and other aspects of the US government and industry, it’s a matter of both technology leadership and good old fashioned national pride. China has already usurped the Titan supercomputer, taking the top two spots in the latest Top 500 list, and the country has its own plans to build an exascale computer for 2020 (and meanwhile, the US Committee on Foreign Investment is looking to further restrict Chinese investment in related fields). So for the US there is a need to keep pace with (and ultimately surpass) any competing systems so that the US maintains its leadership in supercomputer technology.
Source: US Department of Energy
19 Comments
View All Comments
DanNeely - Thursday, June 15, 2017 - link
Do we have a timeline for when Summit and Sierra are supposed to go online, and potentially reclaim the top two spots on the Top500 list?Yojimbo - Thursday, June 15, 2017 - link
Last I heard, Summit is supposed to be assembled this year and be available for general access next year. I guess it might get benchmarked in time for the November 2017 Top500 list. Sierra is a National Nuclear Security Administration system and so there's less impetus for giving a public timeline.Incidentally, Aurora, which was supposed to come online in 2019 and sit between Sierra and Summit in performance, is now being reviewed for changes, and seems like it will probably be delayed. It was to be built using Intel's Knights Hill Xeon Phi processors (the successor to the current generation Knights Landing Xeon Phi).
Kevin G - Saturday, June 17, 2017 - link
Knight's Hill is tied to Intel's 10 nm production. First commercial 10 nm parts are expected late this year but it takes times to ramp up to a level where Intel can produce a 600 mm^2 or greater die on a new node. With Intel's 10 nm node being delayed from original road maps, it would make sense that Aurora would also see similar delays. The idea of a review is telling though.Yojimbo - Saturday, June 17, 2017 - link
Yeah, Paul Messina, senior strategic advisor at the Argonne LCF said, "The Aurora system contract is being reviewed for potential changes that would result in a subsequent system in a different time frame from the original Aurora system, but since that's just early negotiations I don't think we can be any more specific than that."Rocket321 - Friday, June 16, 2017 - link
I'm glad AMD still gets a seat at the table.Ktracho - Friday, June 16, 2017 - link
It could be interesting if, for example, AMD were chosen to contribute its CPU technology, but NVIDIA were chosen to contribute its GPU technology, and Intel its interconnection technology.Yojimbo - Friday, June 16, 2017 - link
These awards are technology research awards. Later, contractors will make proposals for systems that the national labs are looking to purchase. Companies, some or all of these as well as others not included in these research awards, will form various strategic partnerships to create system proposals. For instance, in the previous round of bidding, Intel and Cray worked together and IBM, NVIDIA, and Mellanox worked together. The labs will choose from the systems offered to them by the contractors.For a system where Intel are the primary contractor they would use all their own stuff, of course. IBM would use their own processors and almost certainly NVIDIA GPUs and Mellanox interconnect. HPE are developing their own silicon photonics interconnect that they would probably want to use if they turn out to be successful with it. I am guessing that if Cray were to be the primary contractor and use Intel's OPA, they would probably use Intel's CPUs as well. Neither AMD nor NVIDIA will be primary contractors.
I guess your scenario is possible but I think it's extremely unlikely.
Kevin G - Saturday, June 17, 2017 - link
HPE is likely using the interconnect expertise acquired via SGI which they purchased recently. SGI had the UV line which was fully coherent up to 256 sockets. Scaling to such levels with full coherency can provide a performance boost without the need for network overhead like a cluster. The real question is if Intel increased the limit to the number of sockets and cores x86 designs support. SGI was able to hit those limits previously. (The 64 TB physical memory limitation has at least been removed in SkyLake-EP.)Integr8d - Friday, June 23, 2017 - link
"...in a meaningful manner?""The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light."
I wonder if...