Jump to content
HWBOT Community Forums

Recommended Posts

Posted (edited)

A recent discussion in the Team Cup 2018 thread unearthed a rather peculiar performance boost in GPUPI with Llano CPUs. The boost happens with all BIOS versions below AGESA 1.1.0.3 and shows nearly twice the performance in GPUPI while other benchmarks are not significantly affected. Thanks to @mickulty I was able to look into this issue to help the moderation of this Team Cup stage.

My first step was reproduce the performance boost. I tried Windows 7 SP0 and SP1 and both showed the boost on a GIGABYTE GA-A75-UDH4 with BIOS version F4. Flashing to F5 or F8a removed the performance advantage again. This can be reproducable every time without a single exception or variation.

The next point on my todo list was to check if GPUPI "does the work". I validated that by using GPUPI's intermediate result dumping feature, that creates a dataset which is normally used for virtual devices to test the implementation without actually calculating anything. Side note: These virtual devices are needed to test GPUPI's thread scheduler and its scaling. The intermediate results were 100% valid and showed that the benchmark is calculating 100M correctly without any shortcuts.

Next up was OpenCL. Maybe the IGP of the APU helps with the work? Although theoretically impossible because Llano's integrated GPU does not support double precision calculations, this was a good opportunity to try the new native path of GPUPI 4 that's currently in its Alpha version. It is based on OpenMP, a threading model only compatible to CPUs. The resulting score is even better without using OpenCL:

BIOS F4:

GPUPI4-native-path-llano.thumb.png.0430eb19fdd5dc64a902dc4a7a799da6.png

BIOS F5:

GPUPI4-native-path-llano-BIOS-F5.thumb.png.0825c661ecbfc27231c5df90c9862bd4.png

With the native path the calculation is completely transparent in my disassembler, so it is easy to statically analyze the involved instructions. I was able to narrow it down to the 64 bit integer Modular exponentiation. To make it even easier to work on test cases and optimizations I have a small toolset ready to create micro benchmarks with small parts of the code. I used these to show the following test cases:

BIOS F4:

llano-modpow-results-f4.thumb.png.bd018f4b16aefd0c616f916cf241e35f.png

BIOS F5 and F8a:

llano-modpow-results-f8a.thumb.png.d3299fac8743d1cbc6ab48e573c47ad2.png

What you see here are two micro benchmarks for the modular exponentiation as it is used in GPUPI. The left window (test-modpow-pibatches-dynamicdiv.exe) does multiple modpows with different base, modulo and exponent and shows more than twice the performance per batch for the F4 BIOS (~3 seconds VS 8.x seconds). The right window (test-modpow-pibatches-staticmoddiv.exe) calculates only the third modpow from the left window over and over. Although that should be the same calculation this time there is no difference between F4 and F5/F8a - both are ~1.4 seconds.

That's where it starts to get interesting for us! Why is it so much faster to calculate only one batch over and over (8.8 VS 1.4 seconds) and where is the performance boost now? The devil is in the disassembly:

ida-mod-difference.thumb.png.9eaa1b4884dc2206da4958389f6a0468.png

What you see here are the inner loops of the modular exponentiations. On the left is the slow multi version and on the right the faster 3rd modpow. You need to know now that the modulo is calculated using the remainder of a division. When you search for a div instruction in the faster code on the right you won't find any. That's because we declared the batch with a static variable (more or less) the compiler was able to optimize the always horribly slow 64 bit div and filled in two multiplication, a bit shift right and a subtraction instead, which is way faster. So now we know that these instructions are not the problem, the perform equally on both BIOS versions. And that leaves us with the solution: The performance of the 64 bit integer div instruction.

Finally I was able to write the micro benchmarks that exactly show the problem in numbers:

BIOS F4:

Llano-F4-mul-mod-div-results.thumb.png.df56c3964e8c7a6416b3db9e7988cfaa.png

BIOS F5/F8a:

Llano-F8a-mul-mod-div-results.thumb.png.0f7894c5d7ce6a0c6f9685371a1a9b25.png

From left to right:

  • 64 bit integer multiplication:
    F4 ........... 0.84s
    F5/F8a ... 0.84s
  • 64 bit integer modulo
    F4 ........... 13.7s
    F5/F8a ... 33.86s
  • 64 bit integer division
    F4 .......... 13.69s
    F5/F8a ... 33.86s

TL;DR:

  • The performance difference is reproducable at any given time
  • GPUPI does the work
  • The 64 bit integer division instructions to calculate the modulos inside the Modular exponentiation of the GPUPI core are responsible for the performance difference
  • Starting with AGESA 1.1.0.3 on FM1 presumably all APUs calculate 64 bit integer divisions about 2,5 times slower than it could be. ?
Edited by _mat_
  • Like 2
  • Thanks 13
Posted (edited)

There must be some other voodoo at play because I saw zero difference with pre 1.1.0.3 AGESA BIOS on GA-A75M-UD2H and GA-A75-D3H. ?

Edited by cbjaust
Posted

Well I probably need some kind of guide for newbies because with older CPU's that should be supported by OpenCL and AMD's APP I can never get them recognised, not to mention the difficulty locating AMD's earlier SDK versions. Most of the time I'm just glad GPUPI recognises the CPU and runs. You've got a neat benchmark but it's annoyingly frustrating to just get working.

Posted

It was never meant to run CPUs in the first place (hence the name), but I'm glad it is used with both.

Yeah, OpenCL is a turn-off to say the least. But that will change soon with GPUPI 4.0. As you can see in the screens the native path is already stable and faster than OpenCL on all platforms. The reason why it's taking longer than anticipated is, that an early release without good SSE and AVX support will end up in the same dilemma as GPUPI 3.3. And I really want that AVX support. ?

Posted
5 hours ago, _mat_ said:

TL;DR:

  • The performance difference is reproducable at any given time
  • GPUPI does the work
  • The 64 bit integer division instructions to calculate the modulos inside the Modular exponentiation of the GPUPI core are responsible for the performance difference
  • Starting with AGESA 1.1.0.3 on FM1 presumably all APUs calculate 64 bit integer divisions about 2,5 times slower than it could be. ?

Thanks for all the time you spent investigating this! ? It would be funny to submit a bug report with AMD just to see their reaction.

3 hours ago, cbjaust said:

There must be some other voodoo at play because I saw zero difference with pre 1.1.0.3 AGESA BIOS on GA-A75M-UD2H and GA-A75-D3H. ?

Are you using Win 7 64-bit SP1, AMD SDK 2.91 and GPUpi 3.2 non-legacy with HPET off?

  • Crew
Posted (edited)

Great work identifying the problem!!

And now some wild guessing since I read there are a lot of division operations happen but Im no programmer, so I actually dont understand at all what you wrote on the first page:

http://www.planet3dnow.de/cgi-bin/newspub/viewnews.cgi?id=1334532731

Maybe their patch helps to restore performance?

https://www.passmark.com/forum/performancetest/3705-amd-llano-a-series-benchmark-and-cpu-bug?t=3656

 

edit:

this would also explain why agesa doesnt matter, as this needs to be done by bios manufactures.

Edited by Strunkenbold
  • Thanks 1
Posted

Nice find! This looks exactly like the problem that we have encountered. The only fact that doesn't fit are the PassMark numbers. They are even worse then those in my micro benchmarks for div and modulo.

Well, it might be possible to write a fix that sets the MSR mentioned in the PassMark forum to enable the division unit again. If this were a new CPU generation, I would do it. But for seven year old CPUs this is just overkill.

Btw the article also explains why an old BIOS might not work. Windows could be responsible for disabling the division unit no matter what the BIOS says. So I guess that's why Windows 8 and Windows 10 show no performance advantage with pre AGESA 1.1.0.3 BIOS versions. This might be true for Windows 7 as well when optional updates are installed. I didn't install everything on my test drive, just SP1 and nothing optional.

  • Crew
Posted
3 hours ago, _mat_ said:

Btw the article also explains why an old BIOS might not work. Windows could be responsible for disabling the division unit no matter what the BIOS says. So I guess that's why Windows 8 and Windows 10 show no performance advantage with pre AGESA 1.1.0.3 BIOS versions. This might be true for Windows 7 as well when optional updates are installed. I didn't install everything on my test drive, just SP1 and nothing optional.

Yes I remember there was something similar with those unlocked Haswell mobile CPUs. You have to prevent that Windows loads the intel firmware on boot.

I dont think that you need to implement the fix in GPUPI. All needed is the patch exe from the passmark guys.

To quote their readme:

Quote

Release 1.1
WIN32 release 3 Apr 2012
WIN64 release 3 Apr 2012
- Allow errata 665 work around patch to be removed. i.e If MSRC001_1029[31] is 
  set to 1 (e.g. by BIOS), allow MSRC001_1029[31] to be set to 0.

This is reminds me of The Stilts Bulldozer Conditioner...

All we need is someone to test if it really works.
I think I have somewhere a board and a CPU both from scrap and not known to work maybe its time to test this bundle now.

Posted
6 hours ago, unityofsaints said:

Thanks for all the time you spent investigating this! ? It would be funny to submit a bug report with AMD just to see their reaction.

Are you using Win 7 64-bit SP1, AMD SDK 2.91 and GPUpi 3.2 non-legacy with HPET off?

Not sure if it's HPET or some Windows Update making the performance "Normal" but a fresh 2008 R2 install and the F2 BIOS on the GA-A75-D3H did the trick. I was using a pretty much up to date Windows 7 x64 install before and had HPET on. So yeah. Interesting that you stumbled on to this mad performance and great work by _mat_ verifying his softwares.

Posted
38 minutes ago, Strunkenbold said:

Yes I remember there was something similar with those unlocked Haswell mobile CPUs. You have to prevent that Windows loads the intel firmware on boot.

I dont think that you need to implement the fix in GPUPI. All needed is the patch exe from the passmark guys.

To quote their readme:

This is reminds me of The Stilts Bulldozer Conditioner...

All we need is someone to test if it really works.
I think I have somewhere a board and a CPU both from scrap and not known to work maybe its time to test this bundle now.

The patch when implemented immediately sets the performance back to "normal" :( Rerunning the patch and selecting no to the workaround gets back the mad performance. :)

 

  • Thanks 1
  • Crew
Posted
3 hours ago, cbjaust said:

The patch when implemented immediately sets the performance back to "normal" :( Rerunning the patch and selecting no to the workaround gets back the mad performance. :)

 

Thx for testing this. Maybe _mat_ can confirm.
I think this is getting us to a comfortable position where we can safely allow this "tweak". Its just like we did with The Stilts work optimizing Superpi performance on Bulldozer, afaik he was also messing around with CPU registers.

  • Like 1
  • Thanks 1

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...