Jump to content
HWBOT Community Forums

_mat_

Members
  • Posts

    1003
  • Joined

  • Last visited

  • Days Won

    41

Everything posted by _mat_

  1. Holy hell! Are we really benching 8 cores on 7G already? Great score, congrats!
  2. Congrats Bruno! Nice find on the new Intel OpenCL driver. It implicitely makes use of AVX512 instructions for the first time, so that gives the nice boost if anybody wonders: https://software.intel.com/en-us/articles/opencl-drivers PS: The screenshot is a bit unlucky though because the message box hides this information.
  3. Use Windows 7 for old hardware, it will allow RTC because there is no drift possible there. GPUPI follows the rules of HWBOT there. As for the hardware detection bug on old hardware, I will look into it.
  4. Every bench works differently and therefor has different needs. There is nothing wrong with that. Knowing the benches, the hardware plus all these extra tricks like the right BIOS version/OS/driver/mod is exactly what overclocking is about. You know, Turrican was not a talkative person but if you asked him something very specific about a bench and a (old) platform he could talk a good while about all these little things, that showed why he really was that good. As I said, nothing wrong with that ...
  5. If you really think that then you are on the wrong forum. PS: I just flashed a FM1 board 20 times. Seems like I'm crazy in the coconut. ?
  6. Nice find! This looks exactly like the problem that we have encountered. The only fact that doesn't fit are the PassMark numbers. They are even worse then those in my micro benchmarks for div and modulo. Well, it might be possible to write a fix that sets the MSR mentioned in the PassMark forum to enable the division unit again. If this were a new CPU generation, I would do it. But for seven year old CPUs this is just overkill. Btw the article also explains why an old BIOS might not work. Windows could be responsible for disabling the division unit no matter what the BIOS says. So I guess that's why Windows 8 and Windows 10 show no performance advantage with pre AGESA 1.1.0.3 BIOS versions. This might be true for Windows 7 as well when optional updates are installed. I didn't install everything on my test drive, just SP1 and nothing optional.
  7. It was never meant to run CPUs in the first place (hence the name), but I'm glad it is used with both. Yeah, OpenCL is a turn-off to say the least. But that will change soon with GPUPI 4.0. As you can see in the screens the native path is already stable and faster than OpenCL on all platforms. The reason why it's taking longer than anticipated is, that an early release without good SSE and AVX support will end up in the same dilemma as GPUPI 3.3. And I really want that AVX support. ?
  8. Voodoo is only at play when there is not enough disclosure. There is always an explanation, start digging.
  9. It is possible the OS plays into this as well. Did you try Windows 7?
  10. As promised I looked into the performance "boost" on FM1 with GPUPI.
  11. A recent discussion in the Team Cup 2018 thread unearthed a rather peculiar performance boost in GPUPI with Llano CPUs. The boost happens with all BIOS versions below AGESA 1.1.0.3 and shows nearly twice the performance in GPUPI while other benchmarks are not significantly affected. Thanks to @mickulty I was able to look into this issue to help the moderation of this Team Cup stage. My first step was reproduce the performance boost. I tried Windows 7 SP0 and SP1 and both showed the boost on a GIGABYTE GA-A75-UDH4 with BIOS version F4. Flashing to F5 or F8a removed the performance advantage again. This can be reproducable every time without a single exception or variation. The next point on my todo list was to check if GPUPI "does the work". I validated that by using GPUPI's intermediate result dumping feature, that creates a dataset which is normally used for virtual devices to test the implementation without actually calculating anything. Side note: These virtual devices are needed to test GPUPI's thread scheduler and its scaling. The intermediate results were 100% valid and showed that the benchmark is calculating 100M correctly without any shortcuts. Next up was OpenCL. Maybe the IGP of the APU helps with the work? Although theoretically impossible because Llano's integrated GPU does not support double precision calculations, this was a good opportunity to try the new native path of GPUPI 4 that's currently in its Alpha version. It is based on OpenMP, a threading model only compatible to CPUs. The resulting score is even better without using OpenCL: BIOS F4: BIOS F5: With the native path the calculation is completely transparent in my disassembler, so it is easy to statically analyze the involved instructions. I was able to narrow it down to the 64 bit integer Modular exponentiation. To make it even easier to work on test cases and optimizations I have a small toolset ready to create micro benchmarks with small parts of the code. I used these to show the following test cases: BIOS F4: BIOS F5 and F8a: What you see here are two micro benchmarks for the modular exponentiation as it is used in GPUPI. The left window (test-modpow-pibatches-dynamicdiv.exe) does multiple modpows with different base, modulo and exponent and shows more than twice the performance per batch for the F4 BIOS (~3 seconds VS 8.x seconds). The right window (test-modpow-pibatches-staticmoddiv.exe) calculates only the third modpow from the left window over and over. Although that should be the same calculation this time there is no difference between F4 and F5/F8a - both are ~1.4 seconds. That's where it starts to get interesting for us! Why is it so much faster to calculate only one batch over and over (8.8 VS 1.4 seconds) and where is the performance boost now? The devil is in the disassembly: What you see here are the inner loops of the modular exponentiations. On the left is the slow multi version and on the right the faster 3rd modpow. You need to know now that the modulo is calculated using the remainder of a division. When you search for a div instruction in the faster code on the right you won't find any. That's because we declared the batch with a static variable (more or less) the compiler was able to optimize the always horribly slow 64 bit div and filled in two multiplication, a bit shift right and a subtraction instead, which is way faster. So now we know that these instructions are not the problem, the perform equally on both BIOS versions. And that leaves us with the solution: The performance of the 64 bit integer div instruction. Finally I was able to write the micro benchmarks that exactly show the problem in numbers: BIOS F4: BIOS F5/F8a: From left to right: 64 bit integer multiplication: F4 ........... 0.84s F5/F8a ... 0.84s 64 bit integer modulo F4 ........... 13.7s F5/F8a ... 33.86s 64 bit integer division F4 .......... 13.69s F5/F8a ... 33.86s TL;DR: The performance difference is reproducable at any given time GPUPI does the work The 64 bit integer division instructions to calculate the modulos inside the Modular exponentiation of the GPUPI core are responsible for the performance difference Starting with AGESA 1.1.0.3 on FM1 presumably all APUs calculate 64 bit integer divisions about 2,5 times slower than it could be. ?
  12. No, it's not a benchmark. Suicide runs are very hard to validate because the result relies 100% on a few MSRs or CPUID registers that are read and interpreted. Remember the FM1 locked multi bug? Very satisfying to see a CPU go as high as you want - but it was only the number in the CPU-Z window of course, nothing real. I'd say it's an impossible job to keep the DB clean in this category. It should be discussed if points are removed there. This is only a marketing category anyway. "Wow, 2999WX on 6GGGG ALL CORES!11" ....yeah. Strunkenbold, the way I see it, it's impossible to keep all categories clean that rely on user input, only screenshots and simply not enough and unsafe data points.
  13. Leeghoofd has tested old BIOS versions on an ASUS board as well and the performance boost can be replicated there as well. It will be fun to get behind this but you are right. Leeghoofd should decide what's best for the competition and I will start a new thread as soon as I get the board to report my findings.
  14. The double to integer conversion instructions might be used, but it's hard to say. There could also be a specific OpenCL code path. I don't like that kind of code magic, that's why the new native path is coded by hand. And even then it depends on many different factors if the CPU can be fully used or any bottlenecks occur.
  15. The EVGA SR-2 issue was a timer bug with RTC on this specific mainboard (as far as we currently know). It was not a GPUPI related issue as many other benchmarks rely on RTC as well. CINEBENCH, Aquamark and various older versions of 3DMark rely on the exact same Windows API function (timeGetTime). Which is in itself the same as GetTickCount (ie SuperPI) .. so yeah, all these benchmarks are still skewed/bugged on the SR-2. SSE2 and SSE3 are the most important extensions for GPUPI using OpenCL on CPUs. SSE4a (also supported by Llano), SSE4.1 and SSE4.2 don't add anything particular interesting for the calculation, so that shouldn't make any difference in comparison to Ivy Bridge. As for the bugged output, it's simply impossible. Really. There is no way to get to the result without calculating all partial results and accumulating them precisely. The hexadecimal digits next to the result are not an additional checksum, these digits ARE the result and therfor validate that the calculation was 100% successful beyond any doubt (unless you cheat). From a technical standpoint it is not necessary for the moderation to intervene here. No rule was broken, the timer works, no cheating happend; this is simply a hardware/software combination that is running faster. We don't know what happened in between these BIOS versions and it's quite possible that a patched errata had a huge performance impact on the 64 bit integer and/or double precision performance of Llano. The only way to find out would be to have a deep look inside the calculation to find out what instructions are actually performed and measure them. This could be a great find, so hell yeah .. if anybody sends me the mainboard I would go for it (I have a 3870K btw but no board).
  16. I saw your stream multiple times and measured your final run too. I know it's realtime. But I was actually talking about the slower runs that mickulty has on other mainboards.
  17. Nice research, thanks for that! First and foremost: It's impossible to get the right result with GPUPI without "getting the work done". GPUPI calculates the result digits of pi by accumulating millions or billions of intermediate results. The final sum of all these calculations are the digits of pi which are written next to the result time. There is no way to break this if you don't fuck the benchmark (= knowingly cheat by changing memory or executable code). What is always possible (also unknowingly or unwillingly) is to run into unreliable time measurement of the result. Are you sure that these results correlate 100% with the real time the benchmark actually ran. It should be easy to see when one second is displayed as two in the benchmark. And when GPUPI is using RTC the same time skewing should happen with the OS clock in the task bar which uses the same hardware and software functionality to measure time. If the slower runs are really timed correctly then there is only one explanation: It's actually faster.
  18. OS is different, that's for sure. Graphics card as well, but I doubt that's the issue. Do you have a Windows 7 install lying around?
  19. Thank you for the detailed report, much appreciated. These seem to be two unrelated errors. HWiNFO is not compatible with this system/platform, you can try "Safe Mode" though before disabling it. The second problem is related to the gathering of various system information (mostly WMI) and/or result encryption. It's somehow related to these old sockets, another user had the same problem with an old Athlon 64 system as well. I will look into it, hope I still have test hardware here.
  20. Were you able to scroll down in the debug log after the error occurs? It only shows the first few lines. Btw, there is something very wrong with the timer on this system. Seems like QueryPerformanceFrequency() is returning a crazy high frequency, that can't be correct. But that shouldn't be part of this particular problem.
  21. I have reuploaded GPUPI 3.2 with the latest HWiNFO libraries. It is still called 3.2, just be sure that the HWiNFO*.dll is dated 13th of August 2018. GPUPI 3.3.2 was updated as well. Download: https://www.overclockers.at/news/gpupi-international-support-thread
×
×
  • Create New...