Jump to content
HWBOT Community Forums
_mat_

FM1 and the unexplainable GPUPI performance

Recommended Posts

A recent discussion in the Team Cup 2018 thread unearthed a rather peculiar performance boost in GPUPI with Llano CPUs. The boost happens with all BIOS versions below AGESA 1.1.0.3 and shows nearly twice the performance in GPUPI while other benchmarks are not significantly affected. Thanks to @mickulty I was able to look into this issue to help the moderation of this Team Cup stage.

My first step was reproduce the performance boost. I tried Windows 7 SP0 and SP1 and both showed the boost on a GIGABYTE GA-A75-UDH4 with BIOS version F4. Flashing to F5 or F8a removed the performance advantage again. This can be reproducable every time without a single exception or variation.

The next point on my todo list was to check if a GPUPI "does the work". I validated that by using GPUPI's intermediate result dumping feature, that creates a dataset which is normally used to drive a virtual devices to test the implementation without actually calculating anything. Side note: These virtual devices are needed to test GPUPI's thread scheduler and its scaling. The intermediate results were 100% valid and showed that the benchmark is calculating 100M correctly without any shortcuts.

Next up was OpenCL. Maybe the IGP of the APU helps with the work? Although theoretically impossible because Llano's integrated GPU does not support double precision calculations, this was a good opportunity to try the new native path of GPUPI 4 that's currently in its Alpha version. It is based on OpenMP, a threading model only compatible to CPUs. The resulting score is even better without using OpenCL:

BIOS F4:

GPUPI4-native-path-llano.thumb.png.0430eb19fdd5dc64a902dc4a7a799da6.png

BIOS F5:

GPUPI4-native-path-llano-BIOS-F5.thumb.png.0825c661ecbfc27231c5df90c9862bd4.png

With the native path the calculation is completely transparent in my disassembler, so it is easy to statically analyze the involved instructions. I was able to narrow it down to the 64 bit integer Modular exponentiation. To make it even easier to work on test cases and optimizations I have a small toolset ready to create micro benchmarks with small parts of the code. I used these to show the following test cases:

BIOS F4:

llano-modpow-results-f4.thumb.png.bd018f4b16aefd0c616f916cf241e35f.png

BIOS F5 and F8a:

llano-modpow-results-f8a.thumb.png.d3299fac8743d1cbc6ab48e573c47ad2.png

What you see here are two micro benchmarks for the modular exponentiation as it is used in GPUPI. The left window (test-modpow-pibatches-dynamicdiv.exe) does multiple modpows with different base, modulo and exponent and shows more than twice the performance per batch for the F4 BIOS (~3 seconds VS 8.x seconds). The right window (test-modpow-pibatches-staticmoddiv.exe) calculates only the third modpow from the left window over and over. Although that should be the same calculation this time there is no difference between F4 and F5/F8a - both are ~1.4 seconds.

That's where it starts to get interesting for us! Why is it so much faster to calculate only one batch over and over (8.8 VS 1.4 seconds) and where is the performance boost now? The devil is in the disassembly:

ida-mod-difference.thumb.png.9eaa1b4884dc2206da4958389f6a0468.png

What you see here are the inner loops of the modular exponentiations. On the left is the slow multi version and on the right the faster 3rd modpow. You need to know now that the modulo is calculated using the remainder of a division. When you search for a div instruction in the faster code on the right you won't find any. That's because we declared the batch with a static variable (more or less) the compiler was able to optimize the always horribly slow 64 bit div and filled in two multiplication, a bit shift right and a subtraction instead, which is way faster. So now we know that these instructions are not the problem, the perform equally on both BIOS versions. And that leaves us with the solution: The performance of the 64 bit integer div instruction.

Finally I was able to write the micro benchmarks that exactly show the problem in numbers:

BIOS F4:

Llano-F4-mul-mod-div-results.thumb.png.df56c3964e8c7a6416b3db9e7988cfaa.png

BIOS F5/F8a:

Llano-F8a-mul-mod-div-results.thumb.png.0f7894c5d7ce6a0c6f9685371a1a9b25.png

From left to right:

  • 64 bit integer multiplication:
    F4 ........... 0.84s
    F5/F8a ... 0.84s
  • 64 bit integer modulo
    F4 ........... 13.7s
    F5/F8a ... 33.86s
  • 64 bit integer division
    F4 .......... 13.69s
    F5/F8a ... 33.86s

TL;DR:

  • The performance difference is reproducable at any given time
  • GPUPI does the work
  • The 64 bit integer division instructions to calculate the modulos inside the Modular exponentiation of the GPUPI core are responsible for the performance difference
  • Starting with AGESA 1.1.0.3 on FM1 presumably all APUs calculate 64 bit integer divisions about 2,5 times slower than it could be. 😑
Edited by _mat_
  • Like 2
  • Thanks 11

Share this post


Link to post
Share on other sites

Fantastic testing mate, well done and thank you for digging into this and thanks Mickulty for providing the tools to do so :)

Share this post


Link to post
Share on other sites

There must be some other voodoo at play because I saw zero difference with pre 1.1.0.3 AGESA BIOS on GA-A75M-UD2H and GA-A75-D3H. 🙁

Edited by cbjaust

Share this post


Link to post
Share on other sites

Voodoo is only at play when there is not enough disclosure. There is always an explanation, start digging. :)

Share this post


Link to post
Share on other sites

Well I probably need some kind of guide for newbies because with older CPU's that should be supported by OpenCL and AMD's APP I can never get them recognised, not to mention the difficulty locating AMD's earlier SDK versions. Most of the time I'm just glad GPUPI recognises the CPU and runs. You've got a neat benchmark but it's annoyingly frustrating to just get working.

Share this post


Link to post
Share on other sites

It was never meant to run CPUs in the first place (hence the name), but I'm glad it is used with both.

Yeah, OpenCL is a turn-off to say the least. But that will change soon with GPUPI 4.0. As you can see in the screens the native path is already stable and faster than OpenCL on all platforms. The reason why it's taking longer than anticipated is, that an early release without good SSE and AVX support will end up in the same dilemma as GPUPI 3.3. And I really want that AVX support. 😎

Share this post


Link to post
Share on other sites
5 hours ago, _mat_ said:

TL;DR:

  • The performance difference is reproducable at any given time
  • GPUPI does the work
  • The 64 bit integer division instructions to calculate the modulos inside the Modular exponentiation of the GPUPI core are responsible for the performance difference
  • Starting with AGESA 1.1.0.3 on FM1 presumably all APUs calculate 64 bit integer divisions about 2,5 times slower than it could be. 😑

Thanks for all the time you spent investigating this! 👍 It would be funny to submit a bug report with AMD just to see their reaction.

3 hours ago, cbjaust said:

There must be some other voodoo at play because I saw zero difference with pre 1.1.0.3 AGESA BIOS on GA-A75M-UD2H and GA-A75-D3H. 🙁

Are you using Win 7 64-bit SP1, AMD SDK 2.91 and GPUpi 3.2 non-legacy with HPET off?

Share this post


Link to post
Share on other sites

Great work identifying the problem!!

And now some wild guessing since I read there are a lot of division operations happen but Im no programmer, so I actually dont understand at all what you wrote on the first page:

http://www.planet3dnow.de/cgi-bin/newspub/viewnews.cgi?id=1334532731

Maybe their patch helps to restore performance?

https://www.passmark.com/forum/performancetest/3705-amd-llano-a-series-benchmark-and-cpu-bug?t=3656

 

edit:

this would also explain why agesa doesnt matter, as this needs to be done by bios manufactures.

Edited by Strunkenbold
  • Thanks 1

Share this post


Link to post
Share on other sites

Nice find! This looks exactly like the problem that we have encountered. The only fact that doesn't fit are the PassMark numbers. They are even worse then those in my micro benchmarks for div and modulo.

Well, it might be possible to write a fix that sets the MSR mentioned in the PassMark forum to enable the division unit again. If this were a new CPU generation, I would do it. But for seven year old CPUs this is just overkill.

Btw the article also explains why an old BIOS might not work. Windows could be responsible for disabling the division unit no matter what the BIOS says. So I guess that's why Windows 8 and Windows 10 show no performance advantage with pre AGESA 1.1.0.3 BIOS versions. This might be true for Windows 7 as well when optional updates are installed. I didn't install everything on my test drive, just SP1 and nothing optional.

Share this post


Link to post
Share on other sites
3 hours ago, _mat_ said:

Btw the article also explains why an old BIOS might not work. Windows could be responsible for disabling the division unit no matter what the BIOS says. So I guess that's why Windows 8 and Windows 10 show no performance advantage with pre AGESA 1.1.0.3 BIOS versions. This might be true for Windows 7 as well when optional updates are installed. I didn't install everything on my test drive, just SP1 and nothing optional.

Yes I remember there was something similar with those unlocked Haswell mobile CPUs. You have to prevent that Windows loads the intel firmware on boot.

I dont think that you need to implement the fix in GPUPI. All needed is the patch exe from the passmark guys.

To quote their readme:

Quote

Release 1.1
WIN32 release 3 Apr 2012
WIN64 release 3 Apr 2012
- Allow errata 665 work around patch to be removed. i.e If MSRC001_1029[31] is 
  set to 1 (e.g. by BIOS), allow MSRC001_1029[31] to be set to 0.

This is reminds me of The Stilts Bulldozer Conditioner...

All we need is someone to test if it really works.
I think I have somewhere a board and a CPU both from scrap and not known to work maybe its time to test this bundle now.

Share this post


Link to post
Share on other sites
6 hours ago, unityofsaints said:

Thanks for all the time you spent investigating this! 👍 It would be funny to submit a bug report with AMD just to see their reaction.

Are you using Win 7 64-bit SP1, AMD SDK 2.91 and GPUpi 3.2 non-legacy with HPET off?

Not sure if it's HPET or some Windows Update making the performance "Normal" but a fresh 2008 R2 install and the F2 BIOS on the GA-A75-D3H did the trick. I was using a pretty much up to date Windows 7 x64 install before and had HPET on. So yeah. Interesting that you stumbled on to this mad performance and great work by _mat_ verifying his softwares.

Share this post


Link to post
Share on other sites
38 minutes ago, Strunkenbold said:

Yes I remember there was something similar with those unlocked Haswell mobile CPUs. You have to prevent that Windows loads the intel firmware on boot.

I dont think that you need to implement the fix in GPUPI. All needed is the patch exe from the passmark guys.

To quote their readme:

This is reminds me of The Stilts Bulldozer Conditioner...

All we need is someone to test if it really works.
I think I have somewhere a board and a CPU both from scrap and not known to work maybe its time to test this bundle now.

The patch when implemented immediately sets the performance back to "normal" :( Rerunning the patch and selecting no to the workaround gets back the mad performance. :)

 

  • Thanks 1

Share this post


Link to post
Share on other sites
3 hours ago, cbjaust said:

The patch when implemented immediately sets the performance back to "normal" :( Rerunning the patch and selecting no to the workaround gets back the mad performance. :)

 

Thx for testing this. Maybe _mat_ can confirm.
I think this is getting us to a comfortable position where we can safely allow this "tweak". Its just like we did with The Stilts work optimizing Superpi performance on Bulldozer, afaik he was also messing around with CPU registers.

  • Like 1
  • Thanks 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×