_mat_ Posted September 11, 2018 Posted September 11, 2018 (edited) A recent discussion in the Team Cup 2018 thread unearthed a rather peculiar performance boost in GPUPI with Llano CPUs. The boost happens with all BIOS versions below AGESA 1.1.0.3 and shows nearly twice the performance in GPUPI while other benchmarks are not significantly affected. Thanks to @mickulty I was able to look into this issue to help the moderation of this Team Cup stage. My first step was reproduce the performance boost. I tried Windows 7 SP0 and SP1 and both showed the boost on a GIGABYTE GA-A75-UDH4 with BIOS version F4. Flashing to F5 or F8a removed the performance advantage again. This can be reproducable every time without a single exception or variation. The next point on my todo list was to check if GPUPI "does the work". I validated that by using GPUPI's intermediate result dumping feature, that creates a dataset which is normally used for virtual devices to test the implementation without actually calculating anything. Side note: These virtual devices are needed to test GPUPI's thread scheduler and its scaling. The intermediate results were 100% valid and showed that the benchmark is calculating 100M correctly without any shortcuts. Next up was OpenCL. Maybe the IGP of the APU helps with the work? Although theoretically impossible because Llano's integrated GPU does not support double precision calculations, this was a good opportunity to try the new native path of GPUPI 4 that's currently in its Alpha version. It is based on OpenMP, a threading model only compatible to CPUs. The resulting score is even better without using OpenCL: BIOS F4: BIOS F5: With the native path the calculation is completely transparent in my disassembler, so it is easy to statically analyze the involved instructions. I was able to narrow it down to the 64 bit integer Modular exponentiation. To make it even easier to work on test cases and optimizations I have a small toolset ready to create micro benchmarks with small parts of the code. I used these to show the following test cases: BIOS F4: BIOS F5 and F8a: What you see here are two micro benchmarks for the modular exponentiation as it is used in GPUPI. The left window (test-modpow-pibatches-dynamicdiv.exe) does multiple modpows with different base, modulo and exponent and shows more than twice the performance per batch for the F4 BIOS (~3 seconds VS 8.x seconds). The right window (test-modpow-pibatches-staticmoddiv.exe) calculates only the third modpow from the left window over and over. Although that should be the same calculation this time there is no difference between F4 and F5/F8a - both are ~1.4 seconds. That's where it starts to get interesting for us! Why is it so much faster to calculate only one batch over and over (8.8 VS 1.4 seconds) and where is the performance boost now? The devil is in the disassembly: What you see here are the inner loops of the modular exponentiations. On the left is the slow multi version and on the right the faster 3rd modpow. You need to know now that the modulo is calculated using the remainder of a division. When you search for a div instruction in the faster code on the right you won't find any. That's because we declared the batch with a static variable (more or less) the compiler was able to optimize the always horribly slow 64 bit div and filled in two multiplication, a bit shift right and a subtraction instead, which is way faster. So now we know that these instructions are not the problem, the perform equally on both BIOS versions. And that leaves us with the solution: The performance of the 64 bit integer div instruction. Finally I was able to write the micro benchmarks that exactly show the problem in numbers: BIOS F4: BIOS F5/F8a: From left to right: 64 bit integer multiplication: F4 ........... 0.84s F5/F8a ... 0.84s 64 bit integer modulo F4 ........... 13.7s F5/F8a ... 33.86s 64 bit integer division F4 .......... 13.69s F5/F8a ... 33.86s TL;DR: The performance difference is reproducable at any given time GPUPI does the work The 64 bit integer division instructions to calculate the modulos inside the Modular exponentiation of the GPUPI core are responsible for the performance difference Starting with AGESA 1.1.0.3 on FM1 presumably all APUs calculate 64 bit integer divisions about 2,5 times slower than it could be. ? Edited February 16, 2023 by _mat_ 2 13 Quote
mickulty Posted September 11, 2018 Posted September 11, 2018 Awesome work mat, brilliant how fast you nailed this as well! Quote
Crew Bilko Posted September 11, 2018 Crew Posted September 11, 2018 Fantastic testing mate, well done and thank you for digging into this and thanks Mickulty for providing the tools to do so Quote
cbjaust Posted September 11, 2018 Posted September 11, 2018 (edited) There must be some other voodoo at play because I saw zero difference with pre 1.1.0.3 AGESA BIOS on GA-A75M-UD2H and GA-A75-D3H. ? Edited September 11, 2018 by cbjaust Quote
_mat_ Posted September 11, 2018 Author Posted September 11, 2018 It is possible the OS plays into this as well. Did you try Windows 7? Quote
_mat_ Posted September 11, 2018 Author Posted September 11, 2018 Voodoo is only at play when there is not enough disclosure. There is always an explanation, start digging. Quote
cbjaust Posted September 11, 2018 Posted September 11, 2018 Well I probably need some kind of guide for newbies because with older CPU's that should be supported by OpenCL and AMD's APP I can never get them recognised, not to mention the difficulty locating AMD's earlier SDK versions. Most of the time I'm just glad GPUPI recognises the CPU and runs. You've got a neat benchmark but it's annoyingly frustrating to just get working. Quote
_mat_ Posted September 11, 2018 Author Posted September 11, 2018 It was never meant to run CPUs in the first place (hence the name), but I'm glad it is used with both. Yeah, OpenCL is a turn-off to say the least. But that will change soon with GPUPI 4.0. As you can see in the screens the native path is already stable and faster than OpenCL on all platforms. The reason why it's taking longer than anticipated is, that an early release without good SSE and AVX support will end up in the same dilemma as GPUPI 3.3. And I really want that AVX support. ? Quote
unityofsaints Posted September 11, 2018 Posted September 11, 2018 5 hours ago, _mat_ said: TL;DR: The performance difference is reproducable at any given time GPUPI does the work The 64 bit integer division instructions to calculate the modulos inside the Modular exponentiation of the GPUPI core are responsible for the performance difference Starting with AGESA 1.1.0.3 on FM1 presumably all APUs calculate 64 bit integer divisions about 2,5 times slower than it could be. ? Thanks for all the time you spent investigating this! ? It would be funny to submit a bug report with AMD just to see their reaction. 3 hours ago, cbjaust said: There must be some other voodoo at play because I saw zero difference with pre 1.1.0.3 AGESA BIOS on GA-A75M-UD2H and GA-A75-D3H. ? Are you using Win 7 64-bit SP1, AMD SDK 2.91 and GPUpi 3.2 non-legacy with HPET off? Quote
Crew Strunkenbold Posted September 11, 2018 Crew Posted September 11, 2018 (edited) Great work identifying the problem!! And now some wild guessing since I read there are a lot of division operations happen but Im no programmer, so I actually dont understand at all what you wrote on the first page: http://www.planet3dnow.de/cgi-bin/newspub/viewnews.cgi?id=1334532731 Maybe their patch helps to restore performance? https://www.passmark.com/forum/performancetest/3705-amd-llano-a-series-benchmark-and-cpu-bug?t=3656 edit: this would also explain why agesa doesnt matter, as this needs to be done by bios manufactures. Edited September 11, 2018 by Strunkenbold 1 Quote
Crew Leeghoofd Posted September 11, 2018 Crew Posted September 11, 2018 K made a poll to allow these subs yes or no for ongoing TeamCup, Community decides :p allow-old-agesa-for-gpupi-subs-in-tc2018 I have no problem accepting these as Matt confirmed nothing shady is going on... Quote
_mat_ Posted September 11, 2018 Author Posted September 11, 2018 Nice find! This looks exactly like the problem that we have encountered. The only fact that doesn't fit are the PassMark numbers. They are even worse then those in my micro benchmarks for div and modulo. Well, it might be possible to write a fix that sets the MSR mentioned in the PassMark forum to enable the division unit again. If this were a new CPU generation, I would do it. But for seven year old CPUs this is just overkill. Btw the article also explains why an old BIOS might not work. Windows could be responsible for disabling the division unit no matter what the BIOS says. So I guess that's why Windows 8 and Windows 10 show no performance advantage with pre AGESA 1.1.0.3 BIOS versions. This might be true for Windows 7 as well when optional updates are installed. I didn't install everything on my test drive, just SP1 and nothing optional. Quote
bigblock990 Posted September 11, 2018 Posted September 11, 2018 Amazing work mat! So great to have a passionate developer active here. 1 Quote
Crew Strunkenbold Posted September 11, 2018 Crew Posted September 11, 2018 3 hours ago, _mat_ said: Btw the article also explains why an old BIOS might not work. Windows could be responsible for disabling the division unit no matter what the BIOS says. So I guess that's why Windows 8 and Windows 10 show no performance advantage with pre AGESA 1.1.0.3 BIOS versions. This might be true for Windows 7 as well when optional updates are installed. I didn't install everything on my test drive, just SP1 and nothing optional. Yes I remember there was something similar with those unlocked Haswell mobile CPUs. You have to prevent that Windows loads the intel firmware on boot. I dont think that you need to implement the fix in GPUPI. All needed is the patch exe from the passmark guys. To quote their readme: Quote Release 1.1 WIN32 release 3 Apr 2012 WIN64 release 3 Apr 2012 - Allow errata 665 work around patch to be removed. i.e If MSRC001_1029[31] is set to 1 (e.g. by BIOS), allow MSRC001_1029[31] to be set to 0. This is reminds me of The Stilts Bulldozer Conditioner... All we need is someone to test if it really works. I think I have somewhere a board and a CPU both from scrap and not known to work maybe its time to test this bundle now. Quote
cbjaust Posted September 11, 2018 Posted September 11, 2018 6 hours ago, unityofsaints said: Thanks for all the time you spent investigating this! ? It would be funny to submit a bug report with AMD just to see their reaction. Are you using Win 7 64-bit SP1, AMD SDK 2.91 and GPUpi 3.2 non-legacy with HPET off? Not sure if it's HPET or some Windows Update making the performance "Normal" but a fresh 2008 R2 install and the F2 BIOS on the GA-A75-D3H did the trick. I was using a pretty much up to date Windows 7 x64 install before and had HPET on. So yeah. Interesting that you stumbled on to this mad performance and great work by _mat_ verifying his softwares. Quote
cbjaust Posted September 11, 2018 Posted September 11, 2018 38 minutes ago, Strunkenbold said: Yes I remember there was something similar with those unlocked Haswell mobile CPUs. You have to prevent that Windows loads the intel firmware on boot. I dont think that you need to implement the fix in GPUPI. All needed is the patch exe from the passmark guys. To quote their readme: This is reminds me of The Stilts Bulldozer Conditioner... All we need is someone to test if it really works. I think I have somewhere a board and a CPU both from scrap and not known to work maybe its time to test this bundle now. The patch when implemented immediately sets the performance back to "normal" Rerunning the patch and selecting no to the workaround gets back the mad performance. 1 Quote
Crew Strunkenbold Posted September 11, 2018 Crew Posted September 11, 2018 3 hours ago, cbjaust said: The patch when implemented immediately sets the performance back to "normal" Rerunning the patch and selecting no to the workaround gets back the mad performance. Thx for testing this. Maybe _mat_ can confirm. I think this is getting us to a comfortable position where we can safely allow this "tweak". Its just like we did with The Stilts work optimizing Superpi performance on Bulldozer, afaik he was also messing around with CPU registers. 1 1 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.