Jump to content
HWBOT Community Forums

The Official Team CUP 2018 DDR3 stage thread:


Leeghoofd

Recommended Posts

All available evidence speaks against some magic OCL performance boost with this specific agesa. All other K10 CPUs performs more or less the same. Llano with all boards / bios versions except this one specific combination performs the same. If there was such performance gain on Llano / even for the price of instability/ ,then it would be known in public.

There is no such change in Llano architecture that would allow such performance boost when compared to all other K10. From the other GPUPI results it seems AMD OCL driver benefits greatly from SSE 4.1... and to some extent even from SSSE3. This is the reason K10 is slow in this benchmark compared to Core2, Nehalem or even 15h based processors. K10 lacks these instructions.

I'm sorry but this really sounds like a bug - either in OCL driver or the benchmark itself or maybe something else entirely. It is not a random thing, as it can be reproduced... after all not so long ago there was a similar problem with GPUPI on dual socket 1366 machines which also seemed to be much faster then common sense would suggest... and as it turned out, it was a bug.

  • Thanks 1
  • Sad 1
Link to comment
Share on other sites

1 hour ago, havli said:

All available evidence speaks against some magic OCL performance boost with this specific agesa. All other K10 CPUs performs more or less the same. Llano with all boards / bios versions except this one specific combination performs the same. If there was such performance gain on Llano / even for the price of instability/ ,then it would be known in public.

There is no such change in Llano architecture that would allow such performance boost when compared to all other K10. From the other GPUPI results it seems AMD OCL driver benefits greatly from SSE 4.1... and to some extent even from SSSE3. This is the reason K10 is slow in this benchmark compared to Core2, Nehalem or even 15h based processors. K10 lacks these instructions.

I'm sorry but this really sounds like a bug - either in OCL driver or the benchmark itself or maybe something else entirely. It is not a random thing, as it can be reproduced... after all not so long ago there was a similar problem with GPUPI on dual socket 1366 machines which also seemed to be much faster then common sense would suggest... and as it turned out, it was a bug.

The sr2 bug was a timekeeping bug. It is completely different from that bug as the work is getting done exactly as fast as it says it is. So the only way that it could be a bug is if the amount of work getting done was different. I will upload a video with stopwatch soon, currently doing my own testing and seeing the exact same thing. It might  be a bug, however it would be absolutely stupid to call it a time keeping bug as it's clearly not. Comparing it to a timekeeping bug is comparing apples to oranges. So since the only way it could be a bug is if it's affecting the workload I ask again, how do you propose that the agesa is altering the workload?

Edited by yosarianilives
Link to comment
Share on other sites

The EVGA SR-2 issue was a timer bug with RTC on this specific mainboard (as far as we currently know). It was not a GPUPI related issue as many other benchmarks rely on RTC as well. CINEBENCH, Aquamark and various older versions of 3DMark rely on the exact same Windows API function (timeGetTime). Which is in itself the same as GetTickCount (ie SuperPI) .. so yeah, all these benchmarks are still skewed/bugged on the SR-2.

SSE2 and SSE3 are the most important extensions for GPUPI using OpenCL on CPUs. SSE4a (also supported by Llano), SSE4.1 and SSE4.2 don't add anything particular interesting for the calculation, so that shouldn't make any difference in comparison to Ivy Bridge.

As for the bugged output, it's simply impossible. Really. There is no way to get to the result without calculating all partial results and accumulating them precisely. The hexadecimal digits next to the result are not an additional checksum, these digits ARE the result and therfor validate that the calculation was 100% successful beyond any doubt (unless you cheat).

From a technical standpoint it is not necessary for the moderation to intervene here. No rule was broken, the timer works, no cheating happend; this is simply a hardware/software combination that is running faster. We don't know what happened in between these BIOS versions and it's quite possible that a patched errata had a huge performance impact on the 64 bit integer and/or double precision performance of Llano. The only way to find out would be to have a deep look inside the calculation to find out what instructions are actually performed and measure them. This could be a great find, so hell yeah .. if anybody sends me the mainboard I would go for it (I have a 3870K btw but no board). :)

  • Like 1
  • Thanks 4
Link to comment
Share on other sites

1 minute ago, _mat_ said:

We don't know what happened in between these BIOS versions and it's quite possible that a patched errata had a huge performance impact on the 64 bit integer and/or double precision performance of Llano. The only way to find out would be to have a deep look inside the calculation to find out what instructions are actually performed and measure them. This could be a great find, so hell yeah .. if anybody sends me the mainboard I would go for it (I have a 3870K btw but no board). :)

You're in the EU right?  Shipping shouldn't be too painful and I have 3 other team members working on the stage so no issue for the comp, PM me your address and I can send you my UD4H.

Link to comment
Share on other sites

3 minutes ago, mickulty said:

You're in the EU right?  Shipping shouldn't be too painful and I have 3 other team members working on the stage so no issue for the comp, PM me your address and I can send you my UD4H.

You got a PM. Let's get to the bottom of this. ?

Link to comment
Share on other sites

11 minutes ago, _mat_ said:

SSE2 and SSE3 are the most important extensions for GPUPI using OpenCL on CPUs. SSE4a (also supported by Llano), SSE4.1 and SSE4.2 don't add anything particular interesting for the calculation, so that shouldn't make any difference in comparison to Ivy Bridge.

Well, in that case how do you explain huge performance advantage of 45nm Core2 (SSE 4.1) over 65nm Core2 (SSSE3) ? For example http://hwbot.org/submission/3678638_havli_gpupi_for_cpu___100m_core_2_duo_e8300_1min_53sec_951ms

and http://hwbot.org/submission/3408835_kintaro_gpupi_for_cpu___100m_core_2_duo_e6750_2min_16sec_234ms/

 

Or AMD 15h or 16h (AVX, SSE 4.2) much faster than K10 (SSE3)? in GPUPI K10 is a lot slower, while in older benchmarks - Cinebench for instance it is the other way around. http://hwbot.org/submission/3691480_havli_gpupi_for_cpu___100m_a10_7800_1min_8sec_764ms

and http://hwbot.org/submission/3554886_noms_gpupi_for_cpu___100m_phenom_ii_x4_965_be_1min_28sec_609ms

Link to comment
Share on other sites

Just now, havli said:

Well, in that case how do you explain huge performance advantage of 45nm Core2 (SSE 4.1) over 65nm Core2 (SSSE3) ? For example http://hwbot.org/submission/3678638_havli_gpupi_for_cpu___100m_core_2_duo_e8300_1min_53sec_951ms

and http://hwbot.org/submission/3408835_kintaro_gpupi_for_cpu___100m_core_2_duo_e6750_2min_16sec_234ms/

 

Or AMD 15h or 16h (AVX, SSE 4.2) much faster than K10 (SSE3)? in GPUPI K10 is a lot slower, while in older benchmarks - Cinebench for instance it is the other way around. http://hwbot.org/submission/3691480_havli_gpupi_for_cpu___100m_a10_7800_1min_8sec_764ms

and http://hwbot.org/submission/3554886_noms_gpupi_for_cpu___100m_phenom_ii_x4_965_be_1min_28sec_609ms

I don't think it's sse4, it's probably ssse3 which is also the reason that you can't run timespy cpu test on k10 cpus and also why 775 destroys k10 for a lot of benches.

Link to comment
Share on other sites

The double to integer conversion instructions might be used, but it's hard to say. There could also be a specific OpenCL code path. I don't like that kind of code magic, that's why the new native path is coded by hand. And even then it depends on many different factors if the CPU can be fully used or any bottlenecks occur.

Edited by _mat_
Link to comment
Share on other sites

1 hour ago, _mat_ said:

The double to integer conversion instructions might be used, but it's hard to say. There could also be a specific OpenCL code path. I don't like that kind of code magic, that's why the new native path is coded by hand. And even then it depends on many different factors if the CPU can be fully used or any bottlenecks occur.

This got me thinking and so I tested 2.3.4, it's even faster than 3.1 or 3.2. Got 1m 18s in both 3.1 and 3.2, got 1m 13s in 2.3.4. Finally found a stopwatch so should have a video up shortly.

Edit: So just tested on bios f5 (newer, slower, agesa) and both 3.2 and 3.1 are about a second faster than 2.3.4, need to test on r8a (even newer agesa)

on f7 (new cpu microcode, not agesa) 2.3.4 is once again about 5-6 seconds faster and 3.1/3.2 score identically as they did in f5.

in f8a 2.3.4 is only about 2 seconds faster than 3.2/3.1 so the new agesa once again changes the balance.

Edited by yosarianilives
  • Like 1
  • Thanks 1
Link to comment
Share on other sites

Ok, this discussion has gone on for quite some time now. Although I appreciate @_mat_'s techinical discussion and @mickulty's help in getting him a board, finding the exact root cause of these speed differences will only help for future versions of GPUpi, not for this competition.

Therefore the more urgent need in my mind is a for a final moderation decision for this stage. There are 3 possibilities, some more realistic than others:

  1. Allow only slow AGESA versions in the competition
  2. Allow all AGESA versions in the competition
  3. scrap the stage

Since people have already bought hardware specifically for this stage, I don't think 3) is a good option. That leaves 1) and 2). 1) has a higher moderation overhead, since mod(s) will have to look at the motherboard tab of every submission in this stage. 2) has a lower moderation overhead and the added benefit of making in-competition submissions competitive with submissions outside the competition. If 1) is picked, we could have a submission on air 1 day after the competition that is faster than an LN2 submission inside the competition.

Having someone with a F1A75-V PRO or F1A75-M LE flash back to the oldest BIOS and test it would help here as it would give us enough confidence that this is not a Gigabyte-specific thing (however unlikely that may be).

TBH after that I'd suggest moving this whole discussion into the GPUpi development thread.. there are 7 more stages to talk about in the DDR3 subcategory! ?

  • Like 3
Link to comment
Share on other sites

Leeghoofd has tested old BIOS versions on an ASUS board as well and the performance boost can be replicated there as well.

It will be fun to get behind this but you are right. Leeghoofd should decide what's best for the competition and I will start a new thread as soon as I get the board to report my findings.

  • Thanks 3
Link to comment
Share on other sites

13 minutes ago, havli said:

Did someone try to measure power consumption during these slow / fast runs? It could give some hint - if the extra work is being done, then the CPU also should draw more power.

Maybe not idea testing conditions as I am only on air at the moment but the slow AGESA fluctuated  104 - 105W at the wall throughout the run while the faster one was 108 - 109W. For reference, idle power draw is 43W.

Edited by unityofsaints
Link to comment
Share on other sites

3 hours ago, yosarianilives said:

In the database they are listed as Grenada even though they're Hawaii. As there would need to be hw purchased I'd like to verify before hand. 

Yeah true, GPU-z identifies R9 390X as Hawaii though. I consider them to be Hawaii.

2 hours ago, ozzie said:

280  is tahiti, not hawaii, 285 is tonga, 290 is hawaii, ,

that's what I said, didn't I? ?

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...