Jump to content
HWBOT Community Forums


  • Content Count

  • Joined

  • Last visited

  • Days Won


_mat_ last won the day on July 22

_mat_ had the most liked content!

Community Reputation

88 Excellent

About _mat_

  • Rank
    robo cop
  • Birthday 04/11/1982


  • Location

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. _mat_

    GPUPI - SuperPI on the GPU

    Yes, it's ok.
  2. Congrats Allen! So much research and effort behind this awesome score!
  3. Congrats! Is Windows Server 2012 really faster than Windows 10?
  4. Holy hell! Are we really benching 8 cores on 7G already? Great score, congrats!
  5. Congrats Bruno! Nice find on the new Intel OpenCL driver. It implicitely makes use of AVX512 instructions for the first time, so that gives the nice boost if anybody wonders: https://software.intel.com/en-us/articles/opencl-drivers PS: The screenshot is a bit unlucky though because the message box hides this information.
  6. Use Windows 7 for old hardware, it will allow RTC because there is no drift possible there. GPUPI follows the rules of HWBOT there. As for the hardware detection bug on old hardware, I will look into it.
  7. Great work! The Taichi seems to be an awesome board.
  8. Every bench works differently and therefor has different needs. There is nothing wrong with that. Knowing the benches, the hardware plus all these extra tricks like the right BIOS version/OS/driver/mod is exactly what overclocking is about. You know, Turrican was not a talkative person but if you asked him something very specific about a bench and a (old) platform he could talk a good while about all these little things, that showed why he really was that good. As I said, nothing wrong with that ...
  9. If you really think that then you are on the wrong forum. PS: I just flashed a FM1 board 20 times. Seems like I'm crazy in the coconut. 😵
  10. Nice find! This looks exactly like the problem that we have encountered. The only fact that doesn't fit are the PassMark numbers. They are even worse then those in my micro benchmarks for div and modulo. Well, it might be possible to write a fix that sets the MSR mentioned in the PassMark forum to enable the division unit again. If this were a new CPU generation, I would do it. But for seven year old CPUs this is just overkill. Btw the article also explains why an old BIOS might not work. Windows could be responsible for disabling the division unit no matter what the BIOS says. So I guess that's why Windows 8 and Windows 10 show no performance advantage with pre AGESA BIOS versions. This might be true for Windows 7 as well when optional updates are installed. I didn't install everything on my test drive, just SP1 and nothing optional.
  11. It was never meant to run CPUs in the first place (hence the name), but I'm glad it is used with both. Yeah, OpenCL is a turn-off to say the least. But that will change soon with GPUPI 4.0. As you can see in the screens the native path is already stable and faster than OpenCL on all platforms. The reason why it's taking longer than anticipated is, that an early release without good SSE and AVX support will end up in the same dilemma as GPUPI 3.3. And I really want that AVX support. 😎
  12. Voodoo is only at play when there is not enough disclosure. There is always an explanation, start digging.
  13. It is possible the OS plays into this as well. Did you try Windows 7?
  14. As promised I looked into the performance "boost" on FM1 with GPUPI.
  15. A recent discussion in the Team Cup 2018 thread unearthed a rather peculiar performance boost in GPUPI with Llano CPUs. The boost happens with all BIOS versions below AGESA and shows nearly twice the performance in GPUPI while other benchmarks are not significantly affected. Thanks to @mickulty I was able to look into this issue to help the moderation of this Team Cup stage. My first step was reproduce the performance boost. I tried Windows 7 SP0 and SP1 and both showed the boost on a GIGABYTE GA-A75-UDH4 with BIOS version F4. Flashing to F5 or F8a removed the performance advantage again. This can be reproducable every time without a single exception or variation. The next point on my todo list was to check if a GPUPI "does the work". I validated that by using GPUPI's intermediate result dumping feature, that creates a dataset which is normally used to drive a virtual devices to test the implementation without actually calculating anything. Side note: These virtual devices are needed to test GPUPI's thread scheduler and its scaling. The intermediate results were 100% valid and showed that the benchmark is calculating 100M correctly without any shortcuts. Next up was OpenCL. Maybe the IGP of the APU helps with the work? Although theoretically impossible because Llano's integrated GPU does not support double precision calculations, this was a good opportunity to try the new native path of GPUPI 4 that's currently in its Alpha version. It is based on OpenMP, a threading model only compatible to CPUs. The resulting score is even better without using OpenCL: BIOS F4: BIOS F5: With the native path the calculation is completely transparent in my disassembler, so it is easy to statically analyze the involved instructions. I was able to narrow it down to the 64 bit integer Modular exponentiation. To make it even easier to work on test cases and optimizations I have a small toolset ready to create micro benchmarks with small parts of the code. I used these to show the following test cases: BIOS F4: BIOS F5 and F8a: What you see here are two micro benchmarks for the modular exponentiation as it is used in GPUPI. The left window (test-modpow-pibatches-dynamicdiv.exe) does multiple modpows with different base, modulo and exponent and shows more than twice the performance per batch for the F4 BIOS (~3 seconds VS 8.x seconds). The right window (test-modpow-pibatches-staticmoddiv.exe) calculates only the third modpow from the left window over and over. Although that should be the same calculation this time there is no difference between F4 and F5/F8a - both are ~1.4 seconds. That's where it starts to get interesting for us! Why is it so much faster to calculate only one batch over and over (8.8 VS 1.4 seconds) and where is the performance boost now? The devil is in the disassembly: What you see here are the inner loops of the modular exponentiations. On the left is the slow multi version and on the right the faster 3rd modpow. You need to know now that the modulo is calculated using the remainder of a division. When you search for a div instruction in the faster code on the right you won't find any. That's because we declared the batch with a static variable (more or less) the compiler was able to optimize the always horribly slow 64 bit div and filled in two multiplication, a bit shift right and a subtraction instead, which is way faster. So now we know that these instructions are not the problem, the perform equally on both BIOS versions. And that leaves us with the solution: The performance of the 64 bit integer div instruction. Finally I was able to write the micro benchmarks that exactly show the problem in numbers: BIOS F4: BIOS F5/F8a: From left to right: 64 bit integer multiplication: F4 ........... 0.84s F5/F8a ... 0.84s 64 bit integer modulo F4 ........... 13.7s F5/F8a ... 33.86s 64 bit integer division F4 .......... 13.69s F5/F8a ... 33.86s TL;DR: The performance difference is reproducable at any given time GPUPI does the work The 64 bit integer division instructions to calculate the modulos inside the Modular exponentiation of the GPUPI core are responsible for the performance difference Starting with AGESA on FM1 presumably all APUs calculate 64 bit integer divisions about 2,5 times slower than it could be. 😑