The Stilt's Book of Bulldozer - Revelations: Episode 2 (SuperPI / x87)

Massman · June 19, 2013

Posted over at XtremeSystems but so incredibly epic that it needs to be as here as well. All credit for the work goes to The Stilt who seems to be able to kick pretty much every R&D team in the industry (including AMD's own). This is the BEST I have seen in overclocking for a LONG-LOOOOONG time and beats 99,99% of the "world record" or live OC competition events achievements as far as I'm concerned. This makes DDR3-4400 look like child's play ... this is what overclocking is all about: :ws::woot::ws:.

Download

V1.02B: http://downloads.hwbot.org/downloads/tools/BDC_R1.02B.zip
V1.01B: http://downloads.hwbot.org/downloads/tools/BDC_R1.01B.zip
V1.00B: http://downloads.hwbot.org/downloads/tools/BDC_R1.00B.zip

Changelog

V1.02B: original post

Enhanced the NRAC fix

Added a UAC prompt (admin rights) for Windows Vista / 7 & 8.

Updated the AGESA version info

V1.01B: original post

Added a hardware flag to indicate that the errata has been fixed.

Changed the way how the software is accessing the cores, the tasks are completed quicker than before

An APU specific bug fixed

Added information about the most recent microcode and AGESA versions under Info menu.

Some small changes to the GUI

V1.00B: original post

Initial release

THE STORY

Exactly two year ago, when I tested a Bulldozer based Zambesi CPU for the first I was shocked.
The early sample units were even hotter and slower than the final silicon revision CPUs, which finally were released four months later.

One of the largest single let-down came from the way back: SuperPI.

SuperPI mainly uses legacy x87 instructions which have been almost completely superceded. SuperPI doesn't show any indication what so ever about SMP performance as it can only utilize a single thread. On top of that it has no real world use or purpose as there are newer programs which can calculate PI almost 100 times faster.

Still, SuperPI can almost be considered as a industry standard.

Nowdays it is generally a VERY poor indicator of real world performance, yet it is so addictive for any old school overclocker. It scales very well along with the CPU/NB/DRAM/IO performance and tweaking it is a big challenge. An overclocker who hasn't ever benched SuperPI simply doesn't exist.

SuperPI has a special place in my heart simply because it was one of the first benchmarks I ever ran... almost 14 years ago...

So, why are all of the 15h (Bulldozer) based CPU/APU/NPUs performing so bad in SuperPI? Some people say it is because 15h family has 50% less FPs per core than the preceeding 10h family.

In 15h family a compute unit (two cores) share a FP when the 10/12h family had a dedicated FP for each of the cores. If this would be the only reason, the issue would be solved when the "slave" core of the CU is disabled, leaving a "private" FP for the "master" (BSC) core. However this is not the case and it even shouldn't be as SuperPI is single threaded, remember?

The caches on 15h family have higher latency than 10h family for example, and SuperPI happens to love large & low latency caches.

15h family was initially designed for high frequencies. Just like the F1 engines, they produce no power at low revs. And unfortunately it currently doesn't seem to be possible to build an engine capable reving high enough. We might discuss more about the caches in "Episode 3"... If possible.

Agner Fog from Copenhagen University College of Engineering has made an excellent document about the instruction latencies of the modern CPUs.

Values for 10h family start from page 26, while 15h family values are located at page 36.

Anyway...

Few days when I was doing some low level testing for other purposes, I found something that didn't make any sense to me.

Now I roughly know what it is and what it does, but still some questions remain: Why does this "feature" exist in the first place and why it is activated on all 15h family parts. I would normally assume it is a workaround for some errata, however no bulletin exists for this one either. Also this feature does not exist in any documentation, or it does but only AMD has access to the required level. I find it hard to believe that it would be a design issue as the affected instructions work fine (but slowly) and it existed since early Zambesi revisions and, currently is still present in Richland and probably beyond (within family 15h)...

I'd say it is either a errata fix or a errata fix gone wrong. If it is a programming mistake which has gone un-noticed during the last two years ... That would make me just sad.

Parts affected: AMD Barracuda (Zambesi, Vishera), AMD Comal (Trinity, Richland), AMD Virgo (Trinity, Richland)

Effect: A massive performance hit in application heavily utilizing x87 instructions.

Negative effects: TBD, none found yet. The performance in non x87 applications remains the same or improves very slightly. No instability, increased power consumption, reduced overclockability or anything else abnormal has been observed. However the final conclusion requires far more extended testing than I am able to do myself.

After the fix has been applied SuperPI shows 18-30% improvement in performance. Bigger the calculation, bigger the improvement. Since this kind of fix is quite unheard of, I knew that I would be crucified if I would make such claims without any providing evidence.

I generally hate to do videos however this time it was mandatory. I apologize the quality, 1080p is available but the quality is quite grainy due poor lightning. It was a cloudy day in Helsinki today.

The video shows few important things:

In the video the fix is called as "The Plow of Bulldozer"

SuperPI 1.5 XS Mod validated by online MD5 checksum

CPU-Z 1.64.3 x32 validated by online MD5 checksum (can be found from Stasio's CPU-Z thread)

The clocks are being shown during the calculation (look for the affinity and CPU-Z core selection)

An external clock reference is provided (to prove there is no tampering with the timers, i.e. "Lab Burst" by MSI)

The air cooled setup is shown and so are the CPU temperatures (HWMonitor)

For the 32M SuperPI run (time) you might want to look a reference from HWBot Piledriver 5G challenge thread.

http://hwbot.org/submission/2386335_the_stilt_superpi___32m_a10_5800k_18min_14 sec_718ms/

39 seconds better time with stock CPU clocks (4.1GHz, NB 2500, MEMCLK DDR-2400) than on 5GHz Trinity with 2777MHz NB and DDR-2666 memory clocks.

Since I 'happened' to have some LN2 in my disposal, I decided to do some high clock SuperPI runs on Richland.

AMD 32nm SuperPI 32M record taken easily. Tomorrow when I throw in a Vishera, the reign of 10h should be finally over

All of the runs are either completely or partially on video.

Will upload them once I have time to edit them. I've been filming around 28GB worth of video during the last 48h hours.

Here's his fastest -ever AMD SuperPI 1M

And then the software is almost ready

There are two kind of news bad and good ones.

Let's get rid of the bad ones first:

Originally I tested this fix on three different CPU/APUs (Richland, Trinity and Vishera).

When I went to verify the effects of the fix on Zambezi the system crashed immediately once the necessary changes were written.

After some research I noticed that these registers do not respond on Zambezi based CPUs.

Upon reading all of them return null values and crash the system unless a special method is used.

At first it appeared that these registers do not exist on Zambezi, however after digging a bit deeper I found indication that the registers are there... But for some reason AMD seem to have protected them with a ESI/EDI password on Zambezi.

They do not require any passwords on any Piledriver based APU/CPU.

So the fix will not be available for Zambezi users.

Sorry for the massive let-down

The the good news:

The software is pretty much finished.

It should be available for download within this week.

After the let-down on Zambezi I felt that something had to be done for Zambezi too.

While it does not result as massive boost as the original fix does it still gives something:

SuperPI 1M: > 1 second improvement

SuperPI 8M: > 10 second improvement

SuperPI 32M: > 35 second improvement

It is called as "Zambezi Stack Special (PD)".

Note: There might also be some performance specialbunnyation in some applications when enabled (Zambezi vs. Vishera effect).

Zambezi is significantly faster than Vishera in SuperPI by default so the difference between a "fixed" Vishera and a tuned Zambezi won't be that massive after the "Zambezi Stack Special" configuration.

Edited June 26, 2013 by Massman

Massman · June 19, 2013

No words for this, but The Stilt is a category on his own.

TerraRaptor · June 19, 2013

very deep research by Stilt, bravo. What a shame AMD engineers were not able to find and fix this for such a long time.

Alex@ro · June 19, 2013

This is one man that vendors should struggle to hire him,such an addition to a team would mean a BIG step forward for that company,big congrats to The Stilt!

Bobnova · June 19, 2013

Absolutely amazing, truly.

hokiealumnus · June 19, 2013

Absolutely amazing, truly.

+1.

I have an FX-8150 and an FX-8350 sitting around that could use this treatment when he releases the software. :nana:

Xtreme Addict · June 19, 2013

The Stilt is a magician

hokiealumnus · June 19, 2013

Hardware geeks: "Bulldozer/Piledriver's single threaded performance isn't so hot..."

TheStilt: "There's an app for that."

l0ud_sil3nc3 · June 19, 2013

I have been following this thread @ XS and I am glad to see the Stilt gets the much deserved attention for this awesome project.

Very few give back to the community like this.

thebanik · June 19, 2013

Wow!!! now thats what is hard work, knowledge and dedication.

flanker · June 20, 2013

he is the man...If someone will have great Vishera, maybe he could get around 9s in superpi only! (Andres Yang 8800MHz chip)

The Stilt · June 21, 2013

So, it is friday today isn't it

Bulldozer Conditioner R1.00B

The checksum (MD5) for the zip file is: 418522A93F241CF14EB1D775839AB083

If the checksum does not match the package has been tampered with = delete and re-download from another location.

The checksum can be calculated online if you don't have a suitable software on your computer.

http://onlinemd5.com/

There is not a single bit of malicious code either in the driver or the software itself.

If you are unsure, please check the contents with https://www.virustotal.com

Supported OS: Windows XP / Windows Vista / Windows 7 / Windows 8* (32 & 64-bit)

* Not tested

The x86 version works in both 32 & 64-bit operating systems, while the x64 version is 64-bit only.

The functionality itself is identical between the versions.

Known limitations: Up to 16 CUs (32 cores) supported at the moment. Support for 32CUs (64 cores) will be added in the next version.

Also the R1.00B (Beta) version does not contain the feature to patch the microcode block as I could not make it work stable enough.

The "Errata Fix" button will fix the major errata which can be patched without updating the microcode.

This feature should not be used as a permanent solution, the bios update should still be used as a primary method (updated AGESA + microcode).

Note: Enabling "Zambezi Stack Special (PD)" feature might cause undefined behavior, however each user should test it's functionality on their own. Some applications might indicate a minor r.e.t.a.r.d.a.tion (god damn "specialbunnyaction") in performance, however SuperPI for example receives a nice boost.

Note: "x87 instruction (NRAC) block" -> Enabled means that the instruction is blocked (default on all 15h family APU/CPU/NPUs). Disabling it make the SuperPI "a bit" faster.

There are most certainly some bugs, so in case you come across one, please report them to this thread.

The experiences are very welcome also.

No it is time for the midsummer parties so I might be away for a day or two.

Depending on how epic the headache shall be

Edited June 21, 2013 by The Stilt

Massman · June 21, 2013

Re-hosted: http://downloads.hwbot.org/downloads/tools/BDC_R1.00B.zip

Who's going to be the first to set a new record?

davetheshrew · June 21, 2013

This is amazing, must try it later, Thanks!

Strat · June 21, 2013

Wow man! This is huge work! Huge respect for that! :celebration:

Toolius · June 21, 2013

This is true overclocking... hats off Stilt... *Bows in respect*

This is no replacement for true genius and you just proved it !!!!

Gyrock · June 21, 2013

You are the man for AMD fans, Stilt!! I'm really impressed with your GREAT WORK!! Thanks a lot

Redwoodz · June 21, 2013

The real question is why is this feature there? There is no way it can be just an oversight,not after all the grief AMD got with Vishera release.Conspiracy theories anyone? Oh to be a fly on the wall..

Well done mate!

IanCutress · June 21, 2013

I found something that didn't make any sense to me.
Now I roughly know what it is and what it does, but still some questions remain: Why does this "feature" exist in the first place and why it is activated on all 15h family parts.

Would love to know what the feature is. It's great you have a fix for it, but what exactly is the issue?

It sounds like there was an issue and AMD had to force a slow calculation path on certain instructions in order to correct. Anyone remember the Pentium FDIV bug?

Dreadlockyx · June 21, 2013

Impressive work there ! Nice one, Stilt !

sumonpathak · June 22, 2013

nice!!! time to rebench

btw...just for the record....the patch is accepted in HWBOT?

DOM. · June 22, 2013

How can this app be accepted but the one to disable nvidia tessellation wasn't ??

Freakezoit · June 22, 2013

I fully agree with DOM.

http://forum.hwbot.org/showpost.php?p=144489&postcount=69

GENiEBEN · June 22, 2013

How can this app be accepted but the one to disable nvidia tessellation wasn't ??

Because the Tess patch wasn't done at driver level, sorry. It only fooled the wrapper into launching with Tess off, among other settings.

rbuass · June 22, 2013

Very very good

The Stilt's Book of Bulldozer - Revelations: Episode 2 (SuperPI / x87)

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation