The Tour of Italy with EPYC Milan: Interview with AMD's Forrest Norrod
by Dr. Ian Cutress on March 15, 2021 11:30 AM EST- Posted in
- CPUs
- AMD
- Enterprise CPUs
- Interview
- EPYC
- Zen 3
- Milan
- Castle Peak
As AMD initiates the official global launch of its 3rd Generation EPYC enterprise processor family, codename Milan, we spend some time with AMD’s Forrest Norrod to discuss the new processors, how the pandemic has affected adoption, what new features have influenced AMD’s positioning of its new EPYC, and what future challenges are fast approaching the enterprise processors.
AMD’s 3rd Generation EPYC, also known as Milan, offers up to 64 of the latest generation Zen 3 cores in a single socket, with 128 PCIe 4.0 lanes, eight channel DDR4-3200 memory, and a new raft of performance optimized variants combined with new security features. Leading AMD’s efforts in this space is Forrest Norrod, Senior Vice President and General Manager of the Datacenter and Embedded Solutions Business Group (formerly Enterprise, Embedded, and Semi-Custom Group, EESC). Forrest has been at AMD for over six years, leading the team since the inception of its first generation EPYC product in Naples.
Since that first generation launch, Forrest has overseen the group from a market share below 0.5% (commonly referred to as a rounding error), to above 10%. This market share growth has directly translated into revenue for AMD, and now with a substantial offering in the traditional x86 compute and enterprise market, it will be interesting to see how much of that market AMD can push into, some of which is discussed in this interview. AMD’s acquisition of Xilinx, set to close later this year, is also expected to enable new growth strategies for AMD’s EPYC in non-traditional markets for the company.
Dr. Ian Cutress AnandTech |
Forrest Norrod AMD |
In this interview, we discuss with Forrest the new Milan processors, how the pandemic has affected EPYC adoption, what new features have influenced AMD’s positioning of its new EPYC, custom processor designs, enabling novel solutions to drive exascale systems, and what future challenges are fast approaching the enterprise CPU arena.
Ian Cutress: Thank you for talking with me today! As this interview is going live, AMD is set to be launching its EPYC 3rd Generation processor family, Milan.
Forrest Norrod: That’s right, we’re continuing our tour of Italy. For me it’s a particularly important milestone, because when I joined AMD, about six and a half years ago just right after Lisa became CEO, the mission was to get AMD back to relevance with leadership in the data center. We laid out this tour of Italy, first with Naples, then with Rome, now Milan, and Genoa is next. But the first three steps we put in place way back then, six years ago, out through to Milan, it's incredibly impactful for myself as well as the whole team to deliver that third step, all of which was part of the original plan. It's great to see, and the team has done an incredible job.
IC: So when we get back to flying again, at a future date we will be over in Italy celebrating the launch of a next generation EPYC?
FN: I sure hope so. When we did Naples, AMD was just getting back into it. We were almost the scrappy startup, again, in the data center - we had limited resources, and we didn't have any funds for anything superfluous. But with Rome we were in a little bit better situation, and we actually did an event in Rome - that year we did a European launch event in Rome, about a month after the worldwide event. We had every intention of doing a blowout event in Milan for this generation, and I'm sorry to say that didn't happen! But hopefully by the time Genoa comes around next year, we'll be back in Italy.
IC: The launch of the new 3rd Generation EPYC, Milan, sees the platform move from Zen 2 cores to Zen 3 cores, with updates in performance, cache, and Infinity Fabric. How is AMD positioning Milan in the market with respect to both AMD’s own Rome, and competitive offerings from other companies?
FN: For us it’s the next step. If you think about our strategy, with Naples and the first generation it was about getting back into the market, demonstrating that we could produce an Enterprise-class and Cloud-class product. With Rome, it was really about innovating around chiplets, being the first to 7 nm, and really taking unquestioned throughput performance leadership. It also came with quite good core performance leadership and per core performance as well. With Milan, it’s all about making the core even stronger, taking the unquestioned per core performance leadership, and increasing the security features as well to really bring that next level of cloud native security to the processor. So we view this as sort of the culmination of entry leadership, throughput performance leadership, and performance leadership with Milan, all while building on security.
IC: We last spoke at AMD’s launch of Rome, with the goal to achieve 10% market share. In the middle of last year, AMD reached that goal. Where do we go with Milan – are there market share goals, or specific wins, that would measure the product a success?
FN: Unfortunately, we’re in our quiet period right now and unfortunately I can’t say. But We’re also trying not to set up specific market share goals in the short term. I will say that in the long term, and as we’ve said before, we absolutely intend to surpass the historical high watermark share target. [The] share that AMD used to have back in the Opteron days was about 26%/27% unit share and 33% revenue share. We certainly think we’re on a drive to get back there, but I don’t want to give you any intermediate checkpoints. I will say that we think that Milan is going to continue driving the business, and really it’s incrementally more competitive than the second generation - a big increment. But incrementally more competitive than second generation and we think the customer excitement and traction we’re going to get with it is large.
IC: Also in our last discussion, I asked the question about partner solutions and AMD’s ability to execute with partners to meet demand. The performance of Milan is higher, and as a result I suspect you have even higher demand than before – what has DESG learned from past launches that you’ve rolled into the preparation of this launch to meet those partner requests for system co-design?
FN: One of the great things about Milan is it is socket compatible, and of course, software compatible with Rome. So for the customers that built Rome optimized systems that fully exploit all of the features, such as the PCIe Gen 4, or the memory capability, Milan is pretty much a drop-in replacement. They can drop Milan into their existing platforms and solutions and get an immediate performance bump. By [enabling this], we get the strong foundation of everything that we did on Rome, and we can immediately take advantage or our customers can immediately take advantage of Milan.
So then we thought about how we can continue expanding the ecosystem. We [are focusing] Milan on areas where Rome was good, but perhaps not unquestioned leadership in some of the per core workloads. So that’s where you see a lot of the new solutions.
I would say the other thing that’s really opened up quite a bit is we’re getting wider adoption of our security features. Google recently introduced, about six or seven months ago, their use of secure encrypted virtualization for confidential computing VMs. You’ll see others doing that here shortly. You saw VMware add support for SEV to their [private cloud] as well as [public] cloud offering. You’re going to continue to see a lot of solutions roll out that really take advantage of the new security features.
IC: I’d be remiss for not asking about supply and demand. AMD is currently facing a period of high demand, coupled with shortages in a couple of key areas, such as substrates and packaging technology. I’m not necessarily asking you to comment specifically on that situation, but how does this high demand period change how AMD implements the rollout of Milan compared to a typical enterprise launch?
FN: I think that the whole industry is obviously seeing an unprecedented level of demand. I read, as I was coming in this morning, yet another article about the demand in semiconductors causing supply issues across the industry. We think we’re in pretty good shape in terms of the absolute amount of supply that we’re bringing to the table and certainly, we’re prioritizing our enterprise products, as well as products for some of our very large customers such as the [hyperscalers]. So I don’t think we view the launch as being, in any way, supply gated. It is a question of prioritization.
IC: CEO Dr Lisa Su at the beginning of the year highlighted AMD’s Enterprise/Datacenter division, and Milan, as one of two key focal points for 2021, alongside the commercial business. This is such that Milan is to be a key driver of the company revenue, market share, and brand identity moving forward. How does that sort of focal lens adjust how you approach the product family, the launch, and the messaging?
FN: On the server side, it’s always been about datacenter customers. You do have the distinction there between the Cloud and the traditional Enterprise, and I will say that I think generally we found the Cloud [customers] to be a little bit faster to adopt new technology. Neither group is willing to go [head first] into embracing a new technology without thoroughly vetting it, or without really making sure that it’s not going to disrupt their datacenter operations. But in the Enterprise, or the Enterprise end customers, l’d say are a little more conservative.
That’s why we took this very deliberate multigenerational strategy to constantly build the strength of the product portfolio. It is also super important to do what we said we would do. I’m sure you recall when we launched Naples, I did put that three year generation roadmap out there publicly, and I said we’re going to be shipping Rome in 2019, and we’re going to be shipping Milan by the end of 2020. Which we did. We’re launching Milan this week as your viewers see this, but actually we began full production shipments in 2020. Hanging a three generation roadmap out there, you know, sort of paints a target on your back for the competition to shoot at. But we thought it was more important to establish a benchmark for execution that our customers could look at and confirm what AMD said about delivering a good product and delivering it on time. I think having that credibility is hugely important for customers to make the investments, to embrace our technology, and to adopt it.
A lot of it has been about building the great product that is more and more tuned for the end customer. So throughput with Rome (EPYC 2nd Gen) was hugely important in HPC and is hugely important to Cloud. It was a great product for a lot of legacy enterprise applications as well, but they tend to need be more per thread performance sensitive. Milan was always intended to be the part that is, you know, just exceptional. Absolute leadership for those Enterprise applications. It’s a convergence then AMD having the right product for Enterprise, without giving up on the throughput performance leadership for Cloud. Now we’ve demonstrated three generations - we said what we said we’re going to do, and we did it. Our customers can trust us, and we will be there for you.
IC: We’re now a year into this pandemic, has that put a dampener on any sort of customer expectations? Has there been any reluctance to deploy new systems or adopt new platforms?
FN: I don’t think it’s put a dampener on the intent or the desire. In fact, I think from that regard, it’s been a bit of a tailwind for us, because there are a lot more people working from home, and everybody now needs a PC. It is no longer one PC per household, or in my case, it is now four of five per person. My kids each have two or three devices! So it’s a tailwind from that perspective in Enterprise as well as Cloud.
I will say that I think we did see an impact on Enterprise qualifications. That’s in Q2 and Q3 of last year, just because people shut down. We had to pivot hard to set up more remote testing sites because customers that were testing gear or had intentions to qualify gear on site no longer had people coming in to do that qualification for a long period of time. It definitely stalled us, I’d say for a couple of quarters, in terms of Enterprise customer qualification in particular. But I think that we are well past that at this point.
IC: AMD’s offering for Milan contains products focused at general performance, the 7003 series, as well as a number of specialized elements in performance per core, the F series. What has the feedback been on the F series, as it started with Rome, and how should we look at customer deployments with this new segmentation offering?
FN: Well the F series is definitely designed for customers that really need that per core performance or per thread performance. There are a lot of applications, particularly legacy applications, where that’s critically important either for overall application performance reasons or for license cost reasons. The F series is really targeted at things like EDA tools - the tools we use to design these highly complicated multi-core devices don’t generally scale well with cores! Those common simulation tools are really dominated by per core performance. So we see the F series as being perfect for that, and we’ve had tremendous uptick in EDA tool adoption.
There are also a large number of legacy enterprise applications that are licensed on a per core basis. If you want to maximize your TCO (total cost of ownership), it’s all about how to use as few cores as possible. The software costs typically dominate the cost of the hardware, and so we’ve got customers that are running eight core EPYC in fully loaded systems, fully tricked out with all the memory you can get, and all the I/O. But they’re running an eight core processor, a high frequency processor, and it’s because they’re running a database where they’re being charged on a per core basis. So we really see the per core, the F Series, as perfect for those sorts of applications, those licence cost dominated applications, or when the peculiarities of the application are such that they don’t scale well with the increasing number of cores.
IC: On Milan, the new top processor sits at 280 W. For the launch of Rome, we saw the top processor only at 225 W or 240 W, with a special HPC model for 280 W. With every customer now able to unlock that higher thermal product, it drives performance higher, but at the expense of moving away from peak efficiency. Do you find that customers are looking at anything other than performance? Efficiency is usually a good priority to have, but is it no longer a key concern?
FN: I would say that for many customers it’s all about performance, performance, performance. We originally did not anticipate changing the standard TDP ranges for the mainstream parts, for this generation of Milan, and we thought we were going to keep the same values. In fact part of our socket compatibility strategy around Rome was to keep the socket, the TDP, and to keep the specifications around the power that needs to be supplied and on the different voltage rails.We didn’t think that would change from when we originally defined them for Rome, but we got strong feedback from many many customers. They wanted the flexibility to be able to go higher in power, and that many of their end customers were saying that performance is the dominant thing.
Until you trip over a certain threshold for where the cost of supplying that power or supplying the cooling, it really starts to go asymptotically. After working with the OEMs, we found that 280 watts was sort of that point for most air cooled systems. We also have configurable TDP parts, and the vast majority of parts can operate in a range of 225 to 280 watts, so the customer can make the choice.
IC: One of the metrics to consider with AMD’s processor designs is where the power is going. Aside from the 7nm chiplets from TSMC, the central IO die is from GlobalFoundries, and with the increase in Infinity Fabric performance the IO die is now consuming almost 40% of the total power of the processor. Are we coming up against an IO power wall - what can AMD do on that front if it’s taking power away from the cores?
FN: I would tend to agree with that. I do tell the team that every watt we spend on our IO, or anything but the core, that is power that’s not going to the metric that the customers most care about - executing that code. IO power has been a huge focus for us on this generation, and [it will be] going forward.
So the I/O die (on Milan) is actually tweaked from Rome. Originally we were planning on it being identical. It is very close, but most of the changes [in the new design] were actually around power to improve the efficiency of that I/O die, because we are running more through it in Milan. I think looking forward, we are going to continue to drive more aggressive power sensitive design techniques into the next generation of the uncore.
The I/O die in current EPYC systems is synonymous with what we call the uncore - essentially everything that is not part of the core. In general we’re trying to drive far more innovation around power management and power efficiency in the uncore. You’re going to see us continue to drive the process node very hard on both the cores as well as the uncore. We’re going to continue to drive innovation around the interconnect. So Infinity Fabric as a protocol has got a lot of legs, but you’ll see us continue to do things to make that more and more power efficient, and lower the picojoule per bit of switch traffic.
IC: With Rome we saw a number of AMD’s hyperscaler customers get specialized versions of those processors, with custom core counts / power limits / binned frequencies. Is this going to be the normal going forward, and are you seeing more demand for these customized versions?
FN: Yes, yes, yes and yes. I think everyone that is operating at scale is always acutely interested in tweaking every little knob to extract all of the performance and all of the efficiency. If you look into it you’ll see in many cases the platform that’s being used by a given hyperscale customer might - for example say Tencent, I’m just picking on at random - they don’t use all of the I/O lanes. That particular configuration that they’ve optimized for is for running their instances as efficiently as possible, and they don’t use all of that I/O, and so the customization of the part, yet in least one regard, is turning that I/O off, fusing it off, so that it draws zero power and diverting that power over to the cores. You get a higher base frequency for example in that particular case. I think we’re going to continue to [offer customized solutions] anywhere where there’s a bunch of scale where it’s warranted, where it makes a difference for the end customer. We’re going to continue to explore things like that.
IC: Intel promotes that half of its total enterprise CPU sales are of the ‘customized’ variety for the big customers. Can you care to comment where AMD sits with those proportions?
FN: If I use their definition of semi-custom, it’s probably similar. Yeah, I’d have to think about it, but it’s probably similar.
IC: I guess the next question is if there is a minimum order quantity to get a customized part?
FN: You mean, for you?
IC: Sure, I’ll have a special one! Or perhaps two, let’s make it dual socket!
FN: [chuckles] Well, you’d probably need a couple of spares just in case!
Actually we don’t have a hard and fast rule [about minimum order quantity]. It’s a conversation with the customer. To be candid, it depends on what we think the opportunity is in the long-term. If we think it’s a hyperscaler that’s going to kick the tires, and it’s going to be relatively modest volume in one generation but there’s a great long term prospect, then we’ll be much more accommodating. It’s all about what we see as a long term opportunity.
IC: What is your opinion on locking a given processor to a specific customer’s platform design?
FN: There’s a desire there for security, and to try to improve security and try to improve securing the provenance of a system that’s running in somebody’s data center, that they can be sure that this is exactly what they intended to buy, and that the system provider has signed off on it. [The system] is the one that was built as intended, and nobody’s adjusted it since the time it was built and tested, and you can be assured that’s what you’ve got. So that’s the intent [of locking], and it is something that we do support. We don’t actually charge [our customers] for that by the way - I mean from our point of view, it’s not like we’re making more money by doing that. We’re trying to meet the requests from our OEM customers and some of their end customers to enable absolute supply chain security.
IC: Over the last year we have seen AMD’s customers make advances with technology such as Confidential Computing. What new security enhancements are in the arsenal for Milan?
FN: There are a couple. If you look at what we’ve already implemented in previous generations of EPYC, it was about providing cryptographic isolation, and a crypto engine that could encrypt all of the contents of a virtual machine, or really even just a process. This means that anybody without that key, even the systems administrator, couldn’t look into that virtual machine. In Milan we made a further enhancement of that with secure nested paging that makes it difficult even if the hypervisor is compromised - if somebody deliberately compromised the hypervisor and had a backdoor [to the system], secure nested paging still protects the contents of the state of that encrypted virtual machine.
The other one is related to these return operated, return oriented programming techniques that have led to some of these vulnerabilities. We do have this thing called Shadow Stack that helps provide additional security to make sure that these very subtle effects that some hackers have shown they can extract information from [aren’t possible]. We’re trying to further obscure those and make it difficult to compromise.
IC: The new Milan processors now have a feature for memory interleaving with either 8 channels of DDR4, 6 channels of DDR4, or 4 channels of DDR4. Are we getting to a stage where customers want reduced memory-channel configurations because DDR is taking up too much physical space, or cost?
FN: That’s a great question! So we have some customers that have a particular optimization point and they want a particular amount of memory. They don’t want to compromise any performance for getting that amount of memory, or take up the physical space. In Rome or in Naples, with eight channels of memory you could get full performance - you could get a pretty good well optimized and balanced system with only four channels of memory, obviously your theoretical bandwidth is cut in half, but it was well optimized. If you had six channels of memory, you get this somewhat unbalanced condition where latency and throughput [would depend on a number of factors], so that’s what we’ve really tried to address with the six channel to give that additional flexibility to right size the amount of memory for your workload without giving up any of the performance.
IC: While AMD increases the performance on its processor product line, the bandwidth out to DRAM remains constant. Is there an ideal intercept point where higher bandwidth memory makes sense for a customer?
FN: I think you’re absolutely right, and really at the top of the stack, depending on the workload, that can be the performance limiter. If you’re comparing top of the stack parts in certain workloads, you’re not going to see as much of a performance gain from generation to generation, just because you are memory bandwidth limited at the end of the day.
That’s going to continue as we keep increasing the performance of cores, and keep increasing the number of cores. But you should expect us to continue to increase the amount of bandwidth and memory support. DDR5 is coming, which has quite a bit of headroom to DDR4. We see more and more interest in using high bandwidth memory, for an on-package solution. I think you will see SKU’s in the future from a variety of companies incorporating HBM, especially for AI. That will initially be fairly specialized to be to be candid, because HBM is extremely expensive. So for most the standard DDR memory, even DDR5 memory, means that HBM is going to be confined initially to applications that are incredibly memory latency sensitive, and then you know, it’ll be interesting to how it plays out over time.
You can see a bifurcation coming in the roadmap, where there are parts that have different memory hierarchies. Maybe with storage class memory as the main store with an HBM - on die, or a smaller memory almost like an L4 cache, or maybe a software managed resource that the application can take advantage of. But anyway, I think you’ll see innovation in the memory system in the next few years.
IC: On the topic of innovation, at the end of last year, AMD launched its CDNA architecture and accelerators. With respect to Milan, is there anything here that helps increase performance of those accelerators?
FN: There are some fabric enhancements in Milan that are sort of subtle. They increase the bandwidth between the cores and the accelerators particularly in a fully loaded system. The other thing is that you’ll see there are a few systems that have been built with Milan that allow you to overclock the PCIe links. We support, in some systems, turning up the frequency a little bit faster.
I would be remiss if I didn’t say that we also doubled the INT8 performance of the part. So for customers that still have yet to embrace GPU accelerators or FPGA accelerators, they still want to keep within the standard CPU programming paradigm, and so particularly for inference, we see a number of customers really just running their inference workloads on CPUs. That doubling of the INT8 performance really helps quite a bit.
IC: So just to confirm, that’s overclocking of the PCIe link, not of the core?
FN: Yes, oh yes, yes, yes. Exactly. Overclocking the PCIe. Overclocking is probably the wrong way to put it! There’s a thing called ESM, Extended Speed Mode, it’s a standard and we support it.
IC: AMD recently released the full ROCm 4.0 stack as a complete exascale solution for machine learning and HPC. How does ROCm develop over 2021, and how does that evolve between EPYC and CDNA?
FN: Great question. We’ve publicly talked about ROCm 4.0 a number of times, and we’re going to talk about it at the launch as well. We’re incredibly proud to be part of the effort to build (what we think will be) the first exascale system in the world, which will be deployed at Oak Ridge National Labs later this year. It’s called Frontier, and it really uses a next generation CDNA architecture, Instinct parts, which is something we haven’t announced yet. It also uses a Milan generation CPU, and the reason I say that is it actually is the CPU in that system is something called Trento - it’s a sibling of Milan if you will. It’s slightly different - it has a physically different piece of silicon in the I/O die, so it’s slightly different from Milan. But the key aspect there is something we think is hugely important going forward - it’s a coherent system. The CPU and the GPU will share a coherent virtual address space. The important thing [with a coherent virtual address space] is that you no longer have to spend a lot of the programming time managing fully separate pools of memory. It greatly accelerates certain workloads being able to have a coherent memory pool [shared] between the CPUs and GPUs. We think this is hugely important going forward, and we are super proud that the first instantiation is going to be with the biggest machine in the world.
IC: Well that’s a great little tidbit on Trento, thanks for saying that. So would Frontier be the ultimate visualization of that Infinity Fabric ‘All-to-All’ topology?
FN: We will continue to evolve [Infinity Fabric] over time, but Frontier is a great, great milestone. In Frontier you get this fully coherent Infinity Fabric connecting the CPU to the GPU, and the GPUs to one another. So I think it’s a great proof point for the scalability of Infinity Fabric and what it can do.
Thank you to Forrest Norrod and his team for their time.
19 Comments
View All Comments
Aleph0 - Thursday, March 18, 2021 - link
NGL I skimmed the article to see where they were going after Milan. I actually expected them to continue north into Switzerland so Geneva Bern Zurich etc, not to swerve southwest to Genoa...CrystalCowboy - Tuesday, March 16, 2021 - link
It might be fun to port a few PS5 games to Frontier.Rudde - Tuesday, March 16, 2021 - link
The IO-die power consumption is curious. Norrod claims that AMD has increased its efficiency, while Anandtech testing shows an ~30W increase in power consumption. Are the IO performance improvements really worth the extra power?ballsystemlord - Wednesday, March 17, 2021 - link
Spelling and grammar errors:"That will initially be fairly specialized to be to be candid,..."
Excess "to be":
"That will initially be fairly specialized to be candid,..."
davidefreeman - Saturday, March 20, 2021 - link
So Forrest just confirmed that there are SKU's under development with HBM. That sounds expensive, since not only do they need the HBM itself, but also the underlying silicon interposer. I think that would require a redesign of the package, unless they're able to do something like EMIB.The additional power is going to be difficult to handle, so clocks would have to be lower. If they're redesigning the package with an interposer, it might be an interesting experiment to put the Epyc dies on the interposer, saving the power cost of driving bits over the organic substrate.
With some high-end systems adopting water cooling, maybe at some point they offer the highest performance SKU's with 64GB HBM on 4 stacks, full I/O, and high clock rates, at a 400w peak TDP for water-cooled systems only.
Unfortunately, since he mentioned HBM, not eDRAM, I don't think we'll see an L4 cache on the I/O die.
Makste - Monday, March 22, 2021 - link
"You can see a bifurcation coming in the roadmap, where there are parts that have different memory hierarchies. Maybe with storage class memory as the main store with an HBM - on die, or a smaller memory almost like an L4 cache, or maybe a software managed resource that the application can take advantage of. But anyway, I think you’ll see innovation in the memory system in the next few years."Very interesting lay out of points.
lc0 - Thursday, May 6, 2021 - link
Forrest answered this question, about "PCIe". But somewhere here, they should say that this is on CCIX links. The CCIX supportive PHY will operate in PCIe 4.0 modes, and also support Extended Speed Modes (ESM) for Extended Data Rate (EDR) support. So endpoint devices / accelerators using CCIX, attached to Milan can take advantage of ESM if their PHYs support it.This is not part of PCI Express Base capability. This is a CCIX feature.
> FN: Yes, oh yes, yes, yes. Exactly. Overclocking the PCIe. Overclocking is probably the wrong way to put it! There’s a thing called ESM, Extended Speed Mode, it’s a standard and we support it.
kavontoy - Monday, May 10, 2021 - link
No one charges the Dream ... !!! https://unblocked-gamesez.comRiyasharma - Friday, May 14, 2021 - link
Very interesting interviewhttps://bloggermanyu.com/