It’s been a year and a half since Amazon released their first-generation Graviton Arm-based processor core, publicly available in AWS EC2 as the so-called 'A1' instances. While the processor didn’t impress all too much in terms of its performance, it was a signal and first step of what’s to come over the next few years.

This year, Amazon is doubling down on its silicon efforts, having announced the new Graviton2 processor last December, and planning public availability on EC2 in the next few months. The latest generation implements Arm’s new Neoverse N1 CPU microarchitecture and mesh interconnect, a combined infrastructure oriented platform that we had detailed a little over a year ago. The platform is a massive jump over previous Arm-based server attempts, and Amazon is aiming for nothing less than a leading competitive position.

Amazon’s endeavours in designing a custom SoC for its cloud services started back in 2015, when the company acquired Isarel-based Annapurna Labs. Annapurna had previously worked on networking-focused Arm SoCs, mostly used in products such as NAS devices. Under Amazon, the team had been tasked with creating a custom Arm server-grade chip, and the new Graviton2 is the first serious attempt at disrupting the space.

So, what is the Graviton2? It’s a 64-core monolithic server chip design, using Arm’s new Neoverse N1 cores (Microarchitectural derivatives of the mobile Cortex-A76 cores) as well as Arm’s CMN-600 mesh interconnect. It’s a pretty straightforward design that is essentially almost identical to Arm’s 64-core reference N1 platform that the company had presented back a year ago. Amazon did diverge a little bit, for example the Graviton2’s CPU cores are clocked in at a bit lower 2.5GHz as well as including only 32MB instead of 64MB of L3 cache into the mesh interconnect. The system is backed by 8-channel DDR-3200 memory controllers, and the SoC supports 64 PCIe4 lanes for I/O. It’s a relatively textbook design implementation of the N1 platform, manufactured on TSMC’s 7nm process node.

The Graviton2’s potential is of course enabled by the new N1 cores. We’ve already seen the Cortex-A76 perform fantastically in last year’s mobile SoCs, and the N1 microarchitecture is expected to bring even better performance and server-grade features, all whilst retaining the power efficiency that’s made Arm so successful in the mobile space. The N1 cores remain very lean and efficient, at a projected ~1.4mm² for a 1MB L2 cache implementation such as on the Graviton2, and sporting excellent power efficiency at around ~1W per core at the 2.5GHz frequency at which Amazon’s new chip arrives at.

Total power consumption of the SoC is something that Amazon wasn’t too willing to disclose in the context of our article – the company is still holding some aspects of the design close to its chest even though we were able to test the new chipset in the cloud. Given the chip’s more conservative clock rate, Arm’s projected figure of around 105W for a 64-core 2.6GHz implementation, and Ampere’s recent disclosure of their 80-core 3GHz N1 server chip coming in at 210W, we estimate that the Graviton2 must come in around anywhere between 80W as a low estimate to around 110W for a pessimistic projection.

Testing In The Cloud With EC2

Given that Amazon’s Graviton2 is a vertically integrated product specifically designed for Amazon’s needs, it makes sense that we test the new chipset in its intended environment (Besides the fact that it’s not available in any other way!). For the last couple of weeks, we’ve had preview access for Amazon Web Services (AWS) Elastic Compute Cloud (EC2) new Graviton2 based “m6g” instances.

For readers unfamiliar with cloud computing, essentially this means we’ve been deploying virtual machines in Amazon’s datacentres, a service for which Amazon has become famous for and which now represents a major share of the company’s revenues, powering some of the biggest internet services on the market.

An important metric determining the capabilities of such instances is their type (essentially dictating what CPU architecture and microarchitecture powers the underlying hardware) and possible subtype; in Amazon’s case this refers to variations of platforms that are designed for specialised use-cases, such as having better compute capabilities or having higher memory capacity capabilities.

For today’s testing we had access to the “m6g” instances which are designed for general purpose workloads. The “6” in the nomenclature designates Amazon’s 6th generation hardware in EC2, with the Graviton2 currently being the only platform holding this designation.  

Instance Throughput Is Defined in vCPUs

Beyond the instance type, the most important other metric that defined an instance’s capabilities is its vCPU count. “Virtual CPUs” essentially means your logical CPU cores that’s available to the virtual machine. Amazon offers instances ranging from 1 vCPU to up to 128, with the most common across the most popular platforms coming in sizes of 2, 4, 8, 16, 32, 48, 64, and 96.

The Graviton2 being a single-socket 64-core platform without SMT means that the maximum available vCPU instance size is 64.

However, what this also means, is that we’re quite in a bit of an apples-and-oranges conundrum of a comparison when talking about platforms which do come with SMT. When talking about 64 vCPU instances (“16xlarge” in EC2 lingo), this means that for a Graviton2 instance we’re getting 64 physical cores, while for an AMD or Intel system, we’d be only getting 32 physical cores with SMT. I’m sure there will be readers who will be considering such a comparison “unfair”, however it’s also the positioning that Amazon is out to make in terms of delivered throughput, and most importantly, the equivalent pricing between the different instance types.

Today’s Competition

Today’s article will focus around two main competitors to the Graviton2: AMD EPYC 7571 (Zen1) powered m5a instances, and Intel Xeon Platinum 8259CL (Cascade Lake) powered m5n instances. At the moment of writing, these are the most powerful instances available from the two x86 incumbents, and should provide the most interesting comparison data.

It’s to be noted that we would have loved to be able to include AMD EPYC2 Rome based (c5a/c5ad) instances in this comparison; Amazon had announced they had been working on such deployments last November, but alas the company wasn’t willing to share with us preview access (One reason given was the Rome C-type instances weren’t a good comparison to the Graviton2’s M-type instance, although this really doesn’t make any technical sense). As these instances are getting closer to preview availability, we’ll be working on a separate article to add that important piece of the puzzle of the competitive landscape.

Tested 16xlarge EC2 Instances
  m6g m5a m5n
CPU Platform Graviton2 EPYC 7571 Xeon Platinum 8259CL
vCPUs 64
Cores Per Socket 64 32 24
(16 instantiated)
SMT - 2-way 2-way
CPU Sockets 1 1 2
Frequencies 2.5GHz 2.5-2.9GHz 2.9-3.2GHz
Architecture Arm v8.2 x86-64 + AVX2 x86-64 + AVX512
µarchitecture Neoverse N1 Zen Cascade Lake
L1I Cache 64KB 64KB 32KB
L1D Cache 64KB 32KB 32KB
L2 Cache 1MB 512KB 1MB
L3 Cache 32MB shared 8MB shared
per 4-core CCX
35.75MB shared
per socket
Memory Channels 8x DDR4-3200 8x DDR-2666
(2x per NUMA-node)
6x DDR4-2933
per socket
NUMA Nodes 1 4 2
DRAM 256GB
TDP Estimated
80-110W?
180W 210W
per socket
Price $2.464 / hour $2.752 / hour $3.808 / hour

Comparing the Graviton2 m6g instances against the AMD m5a and Intel m5n instances, we’re seeing a few differences in the hardware capabilities that power the VMs. Again, the most notorious difference is the fact that the Graviton2 comes with physical core counts matching the deployed vCPU number, whilst the competition counts SMT logical cores as vCPUs as well.

Other aspects when talking about higher-vCPU count instances is the fact that you can receive a VM that spans across several sockets. AMD’s m5a.16xlarge here is still able to deploy the VM on a single socket thanks to the EPYC 7571’s 32 cores, however Intel’s Xeon system here employs two sockets as currently there’s no deployed Intel hardware in EC2 which can offer the required vCPU count in a single socket.

Both the EPYC 7571 and the Xeon Platinum 8259CL are parts which aren’t publicly available or even listed on either company’s SKU list, so these are custom parts for the likes of Amazon for datacentre deployments.

The AMD part is a 32-core Zen1 based single-socket solution (at least for the 16xlarge instances in our testing) clocking in at 2.5 GHz all-cores to up to 2.9GHz in lightly threaded scenarios. The peculiarity of this system is that it’s somewhat limited by AMD’s quad-chip MCM system which has four NUMA nodes (one per chip and 2-channel memory controller), a characteristic that’s been eliminated in the newer EPYC2 Zen2 based systems. We don’t have concrete confirmation on the data, but we suspect this is a 180W part based on the SKU number.

Intel’s Xeon Platinum 8259CL is based on the newer Cascade Lake generation CPU cores. This particular part is also specific to Amazon, and consists of 24 enabled cores per socket. To reach the 16xlarge 64 vCPU count, EC2 provides us a dual-socket system with 16 out of the 24 cores instantiated on each socket. Again, we have no confirmation on the matter, but these parts should be rated at 210W per socket, or 420W total. We do have to remind ourselves that we’re only ever using 66% of the system’s cores in our instance, although we do have access to the full memory bandwidth and caches of the system.

The cache configuration in particular is interesting here as things differ quite a bit between platforms. The private caches of the actual CPUs themselves are relatively self-explanatory, and the Graviton2 here does provide the highest capacity of cache out of the trio, but is otherwise equal to the Xeon platform. If we were to divide the available cache on a per-thread basis, the Graviton2 leads the set at 1.5MB, ahead of the EPYC’s 1.25MB and the Xeon’s 1.05MB. The Graviton2 and Xeon systems have the distinct advantage that their last level caches are shared across the whole socket, while AMD’s L3 is shared only amongst 4-core CCX modules.

The NUMA discrepancies between the systems aren’t that important in parallel processing workloads with actual multiple processes, but it will have an impact on multi-threaded as well as single-threaded performance, and the Graviton2’s unified memory architecture will have an important advantage in a few scenarios.

Finally, there’s quite a difference in the pricing between the instances. At $2.46 per hour, the Graviton2 system edges out the AMD system in price, and is massively cheaper than the $3.80 per hour cost of the Xeon based instance. Although when talking about pricing, we do have to remember that the actual value delivered will also wildly depend on the performance and throughput of the systems, which we’ll be covering in more detail later in the article.

We thank Amazon for providing us with preview access to the m6g Graviton2 instances. Aside from giving us access, Amazon nor any other of the mentioned companies have had influence in our testing methodology, and we paid for our EC2 instance testing time ourselves.

CPU Chip Topologies
Comments Locked

96 Comments

View All Comments

  • Wilco1 - Friday, March 13, 2020 - link

    Developing a chip based on a standard Arm core is much cheaper. Arm chip volumes are much higher than Intel and AMD, the costs are spread out over billions of chips.
  • ksec - Tuesday, March 10, 2020 - link

    ARM's licensing comparatively speaking is extremely cheap even for their most expensive N1 Core Blueprint. The development and production cost are largely on ARM's because of the platform model. So Amazon is only really paying for the cost to Fab with TSMC, I would be surprised if those chip cost more than $300. Which is at least a few thousand less than Intel or even AMD.

    Amazon will have to paid for all the software cost though. Making sure all their tools, and software runs on ARM. That is very expensive in engineering cost, but paid off in long term.
  • extide - Friday, March 13, 2020 - link

    Actual production cost is going to be more like $50 or so. WAY less than $300.
  • ksec - Monday, March 30, 2020 - link

    Only the Wafer Cost alone would be $50+ assuming 100% yield. That is excluding licensing and additional R&D. At their volume I would not be surprised it stack up to $300
  • FunBunny2 - Tuesday, March 10, 2020 - link

    "Vertical integration is powerful."

    I find it amusing that compute folks are reinventing the wheel from Henry Ford!! River Rouge.
  • mrvco - Tuesday, March 10, 2020 - link

    It would be interesting to see how the AWS instances compare to performance-competitive Azure instances on a value basis.
  • kliend - Tuesday, March 10, 2020 - link

    Anecdotally, Yes. Amazon is always trying to bring in users for little/no immediate profit.
  • skaurus - Tuesday, March 10, 2020 - link

    At scale, predictability is more important in infrastructure than cost. It may seem that if we have everything we need compiled for Arm, we can just switch over. But these things often look easier in theory than practice. I'd be wary to move existing service to Arm instances, or even starting a new one when I just want to iterate fast and just be sure that underlying level doesn't have any new surprises.
    It will be fine If I have time to experiment, or later, when the dust settles. Right now, I doubt that switching over to these instances once they are available, is actually easy or even smart decision.
  • FunBunny2 - Tuesday, March 10, 2020 - link

    "It may seem that if we have everything we need compiled for Arm, we can just switch over. But these things often look easier in theory than practice. "

    with language compliant compilers, I don't buy that argument. it can certainly be true that RISC-ier processors yield larger binaries and slower performance, but real application failure has to be due to OS mismatches. C is the universal assembler.
  • mm0zct - Wednesday, March 11, 2020 - link

    Beware that in C struct packing is ABI dependent, if you write out a struct to disk on x86_64, and try and read it back in on Aarch64, you might have a bad time unless you use the packed pragma and use specified-width types. This is the sort of thing that might get you if you try to migrate between architectures.

    Also many languages (including C) have hand optimised math libraries with inline assembler, which might still be using plain-C fallbacks on other architectures. There was a good article discussing the migration to Aarch64 at Cloudflare, they particulary encountered issues with go not being optimised on Aarch64 yet https://blog.cloudflare.com/arm-takes-wing/

Log in

Don't have an account? Sign up now