The Ampere Altra Max Review: Pushing it to 128 Cores per Socket

Name: The Ampere Altra Max Review: Pushing it to 128 Cores per Socket
Item: The Ampere Altra Max Review: Pushing it to 128 Cores per Socket
Author: Andrei Frumusanu

by Andrei Frumusanu on October 7, 2021 8:00 AM EST

60 Comments | Add A Comment

60 Comments

It’s been a little over a year since Ampere started to deliver their first generation Altra processors. The “Quicksilver” design with 80 Neoverse N1 cores was the first merchant Arm silicon on the market who really went “all-out” in terms of performance targets, aiming for the best of what AMD and Intel had to offer, ending up in a very competitive standing against the newest EPYC CPUs and leapfrogging Intel’s offerings.

Since that first review, the competition has released two new generation platforms, the newer EPYC Milan chips, showcasing a good generational boost, and Intel dramatically narrowing the performance gap with the new Ice Lake-SP Xeon parts.

Pushing it to 128 Cores

The new Altra Max is a quite exciting part, but it’s also relatively straightforward design compared to the original Altra parts. While the original chip had been pushing 80 Neoverse-N1 cores, the new Altra Max is pushing 128 cores. While there are also slightly improved technical differences between the two chip generations, that is mostly the main large differentiation between the two designs.

Ampere is still continuing to offer both Altra and Altra Max chips in their product line-up, with the Max parts in particular filling the high-core count SKU segment:

Ampere Altra SKU List
AnandTech	Cores	Frequency	TDP	PCIe	DDR4	Price
Altra Max "Mystique"
M128-30 (Tested)	128	3.0 GHz	250 W	128x G4	8 x 3200	$5800
M128-28	128	2.8 GHz	230 W	128x G4	8 x 3200	$5500
M128-26	128	2.6 GHz	190 W	128x G4	8 x 3200	$5400
M112-30	112	3.0 GHz	240 W	128x G4	8 x 3200	$5100
M96-30	96	3.0 GHz	220 W	128x G4	8 x 3200	$4550
M96-28	96	2.8 GHz	190 W	128x G4	8 x 3200	$4250
Altra "Quicksilver"
Q80-33 (Tested)	80	3.3 GHz	250 W	128x G4	8 x 3200	$4050
Q80-30	80	3.0 GHz	210 W	128x G4	8 x 3200	$3950
Q80-26	80	2.6 GHz	175 W	128x G4	8 x 3200	$3810
Q72-30	72	3.0 GHz	195 W	128x G4	8 x 3200	$3590
Q64-33	64	3.3 GHz	220 W	128x G4	8 x 3200	$3810
Q64-30	64	3.0 GHz	180 W	128x G4	8 x 3200	$3480
Q64-26	64	2.6 GHz	125 W	128x G4	8 x 3200	$3260
Q64-24	64	2.4 GHz	95 W	128x G4	8 x 3200	$3090
Q32-17	32	1.7 GHz	45 W	128x G4	8 x 3200	$800

The unit we’re testing today, the flagship Altra Max M128-30, with 128 cores and a 3.0GHz clock (again, noteworthy congratulations of Ampere’s straightforward and descriptive part naming), with a maximum TDP of 250W.

Much like the first-generation parts, platform side features are all identical throughout the product stack, always featuring the maximum 128 lanes of PCIe 4.0 and 8-channel DDR4-3200 capabilities.

Comparing the M128-30 to the Q80-33, the new Altra Max part is able to fit in 60% more cores, albeit at 10% lower frequency, within the same advertised TDP. It’s to be noted that TDP here doesn’t mean power consumption, and in our initial review of the Q80-33 we noted that the chip in many workloads hovered at power levels much below the TDP, possibly explaining why and Ampere was able to grow the core count this much even though the chip isn’t on a fundamentally different process node (TSMC N7), though it’s on a better implementation.

The SKU list for the new Altra Max parts is interesting in that there’s only parts from 96 cores onwards, with anything below that still being serviced by the original Altra SKUs. It’s very likely that due to the process node maturity of the N7 node that Ampere here likely has few chips yielding with fewer cores, and the higher clocks and larger cache of the Quicksilver chips would be better served for lower core count deployments anyhow.

In terms of pricing, Ampere is quite aggressive, vastly undercutting both AMD and Intel’s flagship parts MSRPs, though as always, what large customers and hyperscalers pay are most of the time never in line with those prices anyhow – but it’s still a large win for Ampere in terms of visible pricing.

The Altra Max is extremely straightforward in terms of deployment: following some initial required firmware updates, it’s essentially a drop-in solution on the existing Altra platforms, which is exactly what we did for our review, re-using the original Mount Jade reference server from Wiwynn. The only practical note to make here is that at time of writing, Ampere currently doesn’t have a dual capable firmware stack that would enable swapping around from Altra to Altra Max and vice-versa, our initial setup was a one-way upgrade, with interoperability firmware still being something in the works for the future.

Test Bed and Setup - Compiler Options

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

60 Comments

View All Comments

Jurgen B - Thursday, October 7, 2021 - link
Love your thorough article and testing. This is some serious firing power from the Ampere and makes some great competition for Intel and AMD. I really like the 256T runs on the AMD Dual socket EPYCs (they really are serving me well in floating point research computing), but it seems that future holds some nice innovations in the field!
mode_13h - Thursday, October 7, 2021 - link
Lack of cache seems to be a serious liability, though. For many, it'll be a deal breaker.
Wilco1 - Friday, October 8, 2021 - link
Yet it still beats AMD's 7763 with its humongous 256MB L3 in all the multithreaded benchmarks. Sure, it would be even faster if it had a 64MB L3 cache, however it doesn't appear to be a serious liability. Doing more with far less silicon at a lower price (and power) is an interesting design point (and apparently one that cloud companies asked for).
Jurgen B - Friday, October 8, 2021 - link
Yes, Cache will play a role for many. However, people buying such servers likely have a very specific workload in mind. And thus they now have more choices which of the manufacturer options they prefer, and these choices are really good to see. Compared to 10 years ago, when AMD was much less competitive, it is wonderful to see the innovation.
schujj07 - Friday, October 8, 2021 - link
That isn't true at all. The SPEC java benchmarks have the Epyc ahead, SpecINT Base Rate-N Estimated they are almost equal (despite having half the cores), FP Base Rate-N Estimated the Epyc is ahead, compiling the Epyc is ahead. Anything that will tax the memory subsystem by not fitting into the small cache of the Altra and the performance is lower for the Altera. Per core performance isn't even close.
mode_13h - Saturday, October 9, 2021 - link
Thanks for correcting the record, @schujj07.

The whole concept of adding 60% more cores while halving cache is mighty suspicious. In the most charitable view, this is intended to micro-target specific applications with low memory bandwidth requirements. From a more cynical perspective, it's merely an exercise in specsmanship and maybe trying to gin up a few specific benchmark numbers.
Wilco1 - Saturday, October 9, 2021 - link
If you're that cynical one could equally claim that adding *more* cache is mighty suspicious and gaming benchmark numbers. Obviously nobody would spend a few hundred million on a chip just to game benchmarks. The fact is there is a market for chips with lots of cores. Half the SPEC subtests show huge gains from 60% extra cores despite the lower frequency and halved L3. So clearly there are lots of applications that benefit from more cores and don't need a huge L3.
Wilco1 - Saturday, October 9, 2021 - link
The Altra Max wins the more useful critical-jOPS benchmark by over 30%. It also wins the LLVM compile test and SPECINT_rate by a few percent. The 7763 only wins SPECFP by 18% (not Altra's market) and max-jOPS by 13%.

So yes my point is spot on, the small cache does not look at all like a serious liability. Per-core performance isn't interesting when comparing a huge SMT core with a tiny non-SMT core - you can simply double the number of cores to make up for SMT and still use half the area...
mode_13h - Saturday, October 9, 2021 - link
> Per-core performance isn't interesting when comparing ...

Trying to change the subject? We didn't mention that. We were talking only about cache.

> The Altra Max wins the more useful critical-jOPS benchmark by over 30%.

That's really about QoS, which is a different story. Surely, relevant for some. I wonder if x86 CPUs would do better on that front with SMT disabled.

> the small cache does not look at all like a serious liability.

Of course it's a liability! It's just a very workload-dependent one. You need only note the cases where Max significantly underperforms, relative to its 80-core sibling, to see where the cache reduction is likely an issue.

The reason why there are so many different benchmarks is that you can't just seize on the aggregate numbers to tell the whole story.
mode_13h - Saturday, October 9, 2021 - link
Apologies, I now see where schujj07 mentioned per-core performance. I even searched for "per-core" but not "per core".

The Ampere Altra Max Review: Pushing it to 128 Cores per Socket

Related Reading:

Pushing it to 128 Cores

Post Your Comment

60 Comments

View All Comments

Jurgen B - Thursday, October 7, 2021 - link

mode_13h - Thursday, October 7, 2021 - link

Wilco1 - Friday, October 8, 2021 - link

Jurgen B - Friday, October 8, 2021 - link

schujj07 - Friday, October 8, 2021 - link

mode_13h - Saturday, October 9, 2021 - link

Wilco1 - Saturday, October 9, 2021 - link

Wilco1 - Saturday, October 9, 2021 - link

mode_13h - Saturday, October 9, 2021 - link

mode_13h - Saturday, October 9, 2021 - link

Log in

Don't have an account? Sign up now