Imagination Announces PowerVR Series7XT Plus Family - Rogue Gets Improved Compute
by Ryan Smith on January 6, 2016 10:00 AM EST- Posted in
- GPUs
- Mobile
- Imagination Technologies
- PowerVR
- CES 2016
A regular sight at CES at most years is a new PowerVR graphics announcement from the crew over at Imagination, and this year is no exception. Shortly before CES last year we were introduced to the company’s PowerVR Series7XT family, a significant iteration on their base Rogue architecture that added full support for the Android Extension Pack to their GPUs, along with specific improvements to improve energy efficiency, overall graphics performance, and compute performance. Imagination also used Series7XT to lay the groundwork for larger designs containing more GPU clusters, giving the architecture the ability to scale up to a rather sizable 16 cores.
After modernizing Rogue’s graphics capabilities with Series7XT, for their follow-up Imagination is taking a slightly different path. This year they are turning their efforts towards compute, with while also working on energy and memory efficiency on the side. To that end the company is using CES 2016 to announce the next iteration of the Rogue architecture, PowerVR Series7XT Plus.
With Series7XT Plus, Imagination is focusing first and foremost on improving Rogue’s compute performance and compute capabilities. To accomplish this they are making two important changes to the Rogue architecture. First and foremost, Imagination is upgrading Rogue’s integer ALUs to more efficiently handle smaller integer formats.
Though Imagination hasn’t drawn out the integer ALUs in previous generations’ architecture diagrams, the architecture has always contained INT32 ALUs. What has changed for Series7XT then is how those ALUs handle smaller INT16 and INT8 formats. Previously those formats would be run through the integer ALUs as INT32s, which though practical meant that there were few performance gains from using smaller integers since they weren’t really processed as smaller numbers. Series7XT Plus significantly changes this: the integer ALUs can now combine operations into a single operation based on their width. One ALU can now process 1 INT32, 2 INT16s, or 4 INT8s.
Imagination’s press release doesn’t offer a ton of detail in how they are doing this, however I suspect that they have gone with the traditional (and easiest) method, which is to simply bundle like-operations. An example of this would be bundling 4 INT8 adds into what is essentially one large INT32 addition operation, an action that requires minimal additional work from the ALU. If this is the case then the actual performance gains from using and combining smaller operations will depend on how often these operations are identical and can be bundled, though since we’re talking about parallel computing, it should be the case quite often.
From an architecture perspective this is an interesting and unexpected departure from Imagination’s usual design. One of the traditional differences between PowerVR and competitor ARM’s Mali designs is that Imagination went with dedicated FP16 and FP32 ALUs, whereas ARM would combine operations to fill out a 128-bit SIMD. The dedicated ALU approach has traditionally allowed for greater power efficiency (your ALUs are simpler), but it also means you can end up with ALUs going unused. So for Imagination to go this route for integers is surprising, though I suspect the fact that integer ALUs are simpler to begin with has something to do with it.
As for why Imagination would care about integer performance, this brings us back to compute workloads. Rather like graphics, not all compute workloads require full INT32/FP32 precision, with computer vision being the textbook example for compute workloads. Consequently, by improving their handling of lower precision integers, Imagination can boost their performance in these workloads. For a very low precision workload making heavy use of INT8s, the performance gains can be up to 4x as compared to using INT32s on Series7XT. Pragmatically speaking I’m not sure how much computer vision work that phone SoCs will actually be subjected to – it’s still a field looking for its killer apps – but at the same time from a hardware standpoint I expect that this was one of the easier changes that Imagination could make, so there’s little reason for Imagination not to do this. Though it should also be noted that Rogue has far fewer integer ALUs than FP ALUs - there is just 1 integer pipeline per USC as opposed to 16 floating point pipelines - so even though smaller integers are now faster, in most cases floating point should be faster still.
Update: Imagination has sent over a newer USC diagram, confirming that there are two integer ALUs per pipeline (with 16 pipelines) rather than just a total of two ALUs per USC.
Moving on, along with augmenting their integer ALUs, Imagination is also bringing OpenCL 2.0 support to their GPUs for the first time with Series7XT Plus. Previous PowerVR parts were only OpenCL 1.2 capable, so for Imagination 2.0 support is a big step up, and one that required numerous small changes to various areas of the Rogue architecture to support 2.0’s newer features.
We’ve already covered OpenCL 2.0 in depth before, so I won’t go too deep here, but for Imagination the jump to OpenCL 2.0 will bring them several benefits. The biggest change here is that OpenCL 2.0 adds support for shared virtual memory (and pointers) between CPU and GPU, which is the cornerstone of heterogeneous computing. Imagination of course also develops the MIPS architecture, so they now have a very straightforward path towards offering customers a complete heterogeneous computing environment if they need one. Otherwise from a performance perspective, OpenCL 2.0’s dynamic parallelism support should improve compute performance in certain scenarios by allowing compute kernels to directly launch other compute kernels. This ultimately makes Imagination just the second mobile SoC vendor to announce support for OpenCL 2.0, behind Qualcomm and the Adreno 500 series.
Aside from compute improvements, for Series7XT Plus Imagination has also made some smaller general improvements to Rogue to further improve power efficiency. Of particular note here is the Image Processing Data Master, a new command processor specifically for 2D workloads. By routing 2D operations through this simpler command processor, Imagination can save power by not firing up the more complex pixel/vertex data masters, making this another example of how mobile GPUs have slowly been adding more dedicated hardware as the power is more important than the die size cost. Meanwhile Imagination’s press release also notes that they have made some memory system changes, including doubling the memory burst size to match newer fabrics and components (presumably this is an optimization for DDR4), and tweaking the caches and their respective sizes to reduce off-chip memory bandwidth needs by 10% or so.
Overall these efficiency changes don’t appear to be as extensive as what we saw with Series7XT – and Imagination isn’t treating them as nearly as big of a deal – so the jump from Series7XT to Series7XT Plus shouldn’t be as great as what came before. Series7XT Plus in that regard is definitely a more incremental upgrade of Rogue, with Imagination focusing on improving a few specific use cases over the last year.
PowerVR GPU Comparison | |||||
Series7XT Plus | Series7XT | Series6XT | |||
Clusters | 2 - 16 | 2 - 16 | 2 - 8 | ||
FP32 FLOPS/Clock | 128 - 1024 | 128 - 1024 | 128 - 512 | ||
FP16 Ratio | 2:1 | 2:1 | 2:1 | ||
INT32 OPS/Clock | 128 - 1024 | 128 - 1024 | 128 - 512? | ||
INT8 Ratio | 4:1 | 1:1 | 1:1 | ||
Pixels/Clock (ROPs) | 4 - 32 | 4 - 32 | 4 - 16 | ||
Texels/Clock | 4 - 32 | 4 - 32 | 4 - 16 | ||
OpenGL ES | 3.2 | 3.2 | 3.1 | ||
Android Extension Pack / Tessellation | Yes | Yes | Optional | ||
OpenCL | 2.0 | Base: 1.2 EB Optional: 1.2 FP |
1.2 EB | ||
Architecture | Rogue | Rogue | Rogue |
Finally, along with announcing the overarching Series7XT Plus family and its architecture, Imagination is also announcing two initial GPU designs for this family: GT7200 Plus and GT7400 Plus. As alluded to by their names, these are Series7XT Plus versions of the existing two-cluster GT7200 and four-cluster GT7400 designs. That imagination is only announcing smartphone designs is a bit odd – both of these designs are smaller than the GT7600 used in number-one customer Apple’s A9 smartphone SoC – though as Apple is the only customer using such a large design in a phone, for Imagination’s other customers these designs are likely more appropriate.
In any case, while Imagination does not formally announce when to expect their IP to show up in retail products, if history is any indicator, we should be seeing Seires7XT Plus designs by the end of this year and leading into 2017.
Source: Imagination
35 Comments
View All Comments
iwod - Thursday, January 7, 2016 - link
We have reached the point where no more features are required for AAA games, unlike the Direct X 7 - 10 era. The mobile Open GL ES cherry pick features we have on desktop for performance / watt / quality reason.GPU's today are limited by 3 things, Memory Bandwidth, Processing Node and Drivers Quality.
Memory Bandwidth is the easiest one, you can put an 512 bit Memory controller and call it a day with GDDR5. You still have to option for GDDR5x. It is merely an cost issues. Now that we have HBM and coming HBM2. It is unlikely memory bandwidth will be an issues for a few more years down the road.
The only reason why AMD Polaris has an jump in performance / watt is purely because of the jump in 28nm to 14nm FinFET. You can mix and match, optimize the GCN only so much.
Then there is the biggest and hardest problem in town, Drivers. Drivers for GPU has gotten so god damn complex even the GPU makers sometime fail to understand how they even got to where they are. Optimization and shortcut are placed every where trying squeeze every ounce of performance for special paths, or specific engines. It is the sole reason why we have got back to Metal, Mantel, Vulkan and even next gen Direct X.
There is no reason why a top end Smartphone GPU Arch like the Series &XT Plus here cant work on a laptop. Given the same memory bandwidth allowance and die space. GPU are hugely parallel unit, you will need some specific design for 32, 64 or even 128 Cluster. And they will perform on par with those from AMD and Nvidia.
tuxRoller - Thursday, January 7, 2016 - link
Do you have a link to the verilog files for the Polaris gpu? Obviously you have access to it since you're able to state, definitively, where all their efficiency gains are coming from.djgandy - Thursday, January 7, 2016 - link
You can't just slap GDDR5 on a mobile chip and suddenly have desktop grade graphics. The entire chip architecture is designed around a low bandwidth low power use case. With desktop GPU memory bandwidth there would have to be an architecture overhaul.BurntMyBacon - Thursday, January 7, 2016 - link
@djgandy: "You can't just slap GDDR5 on a mobile chip and suddenly have desktop grade graphics. The entire chip architecture is designed around a low bandwidth low power use case."Not only that, but it is more challenging to keep the GPU busy as the chip gets larger. Significant die space has to go into making sure the correct signals arrive at the correct time on chips approaching 600mm2. Just distributing clocks across the chip is a challenging undertaking. All this extra circuitry takes power and reduces efficiency compared to a smaller chip (everything else equal), but that drop in efficiency is necessary if performance is a priority. Some architectural considerations are made based on the effective routing delay. For instance, it makes more sense to put more logic in between registers, increasing the IPCs, but reducing the clock rate, when you have larger routing delays to cover up (often found in larger chips). If all routing delays are small, it may be better to reduce logic between registers, lowering IPCs, but allowing clock rates to ramp up or voltages to be dropped at the same clock rate.
lucam - Saturday, January 9, 2016 - link
I agree. GPUs for desktop class are quite complex and require a significant investment with a possibility to get a low profit or total failure. That's why IMG has focused their products inside the mobile space.I also think that PowerVr architecture becomes meaningless when you apply GDDR memory or HBM, they just lose the advantage of efficient bandwidth (due title approach) comparing to other competitors that use high bandwidth memory.
name99 - Saturday, January 30, 2016 - link
Of course high bandwidth lower power RAM like HMC and HBM changes the equation...No-one is talking about doing this in a GDDR world.
BurntMyBacon - Thursday, January 7, 2016 - link
@iwod: "GPU's today are limited by 3 things, Memory Bandwidth, Processing Node and Drivers Quality."I'm going to add in size, power, and thermal constraints. Even if you are on the last gen processing node, you can achieve better performance if your application allows for a chip that is four times the size, 10 times the thermal envelop, and 100 times the power draw. I'm going to assume you considered this and didn't type it out.
@iwod: "We have reached the point where no more features are required for AAA games, unlike the Direct X 7 - 10 era. The mobile Open GL ES cherry pick features we have on desktop for performance / watt / quality reason."
OpenGL ES is a subset of OpenGL. For simplicity, we will assume OpenGL is equal, but different to DirectX. DirectX is the baseline for most PC and XBox games. Your comment suggests that either nobody who makes AAA games uses the features that are not in common with OpenGL ES, or that use of these features is not beneficial. The first is quantifiably false. I'm not a game dev, but I'm skeptical that the difference in detail and graphical immersion between phones/tablets and console/PCs is entirely due to framerate capabilities.
@iwod: "The only reason why AMD Polaris has an jump in performance / watt is purely because of the jump in 28nm to 14nm FinFET."
Clearly there is nothing to be gain from architectural tweeks. It all comes down to the process node. After all past history shows that all 28nm processors had the same performance per watt. No improvement at all in the last three generations of GPUs. Certainly no difference between vendors to suggest that there are architectural improvements to be had that might improve efficiency. Naturally, phones and tablets are just as inefficient as laptop GPUs on the same node, which are themselves no better than their desktop GPU counterparts. WAIT WHAT?!?
babadivad - Wednesday, January 6, 2016 - link
You forgot the S/domboy - Thursday, January 7, 2016 - link
A resurrection of the PowerVR Kyro series perhaps??Alexvrb - Friday, January 8, 2016 - link
As a complete GPU, taking on Nvidia and AMD directly? I kind of doubt it. I wouldn't mind seeing them try, however. I remember my Kyro cards fondly.But it would be neat to see them build a Raytracing accelerator, if it was cheap enough and drew very little power. Unfortunately then you've got the chicken and egg problem... engine support vs userbase.