The Official Team CUP 2018 DDR3 stage thread:

havli · September 3, 2018

All available evidence speaks against some magic OCL performance boost with this specific agesa. All other K10 CPUs performs more or less the same. Llano with all boards / bios versions except this one specific combination performs the same. If there was such performance gain on Llano / even for the price of instability/ ,then it would be known in public.

There is no such change in Llano architecture that would allow such performance boost when compared to all other K10. From the other GPUPI results it seems AMD OCL driver benefits greatly from SSE 4.1... and to some extent even from SSSE3. This is the reason K10 is slow in this benchmark compared to Core2, Nehalem or even 15h based processors. K10 lacks these instructions.

I'm sorry but this really sounds like a bug - either in OCL driver or the benchmark itself or maybe something else entirely. It is not a random thing, as it can be reproduced... after all not so long ago there was a similar problem with GPUPI on dual socket 1366 machines which also seemed to be much faster then common sense would suggest... and as it turned out, it was a bug.

yosarianilives · September 3, 2018

1 hour ago, havli said:

All available evidence speaks against some magic OCL performance boost with this specific agesa. All other K10 CPUs performs more or less the same. Llano with all boards / bios versions except this one specific combination performs the same. If there was such performance gain on Llano / even for the price of instability/ ,then it would be known in public.

There is no such change in Llano architecture that would allow such performance boost when compared to all other K10. From the other GPUPI results it seems AMD OCL driver benefits greatly from SSE 4.1... and to some extent even from SSSE3. This is the reason K10 is slow in this benchmark compared to Core2, Nehalem or even 15h based processors. K10 lacks these instructions.

I'm sorry but this really sounds like a bug - either in OCL driver or the benchmark itself or maybe something else entirely. It is not a random thing, as it can be reproduced... after all not so long ago there was a similar problem with GPUPI on dual socket 1366 machines which also seemed to be much faster then common sense would suggest... and as it turned out, it was a bug.

The sr2 bug was a timekeeping bug. It is completely different from that bug as the work is getting done exactly as fast as it says it is. So the only way that it could be a bug is if the amount of work getting done was different. I will upload a video with stopwatch soon, currently doing my own testing and seeing the exact same thing. It might be a bug, however it would be absolutely stupid to call it a time keeping bug as it's clearly not. Comparing it to a timekeeping bug is comparing apples to oranges. So since the only way it could be a bug is if it's affecting the workload I ask again, how do you propose that the agesa is altering the workload?

Edited September 3, 2018 by yosarianilives

Mr.Scott · September 3, 2018

Since nobody knows exactly what the problem is, you do quite a bit of speculating.

The only fact is, there IS a problem. A reproducible one at that. That in itself is enough to toss the subs, and possibly the bench further down the road.

Mr.Scott · September 3, 2018

Laugh all you like. A bug is a bug is a bug.

It will never be allowed.

_mat_ · September 3, 2018

The EVGA SR-2 issue was a timer bug with RTC on this specific mainboard (as far as we currently know). It was not a GPUPI related issue as many other benchmarks rely on RTC as well. CINEBENCH, Aquamark and various older versions of 3DMark rely on the exact same Windows API function (timeGetTime). Which is in itself the same as GetTickCount (ie SuperPI) .. so yeah, all these benchmarks are still skewed/bugged on the SR-2.

SSE2 and SSE3 are the most important extensions for GPUPI using OpenCL on CPUs. SSE4a (also supported by Llano), SSE4.1 and SSE4.2 don't add anything particular interesting for the calculation, so that shouldn't make any difference in comparison to Ivy Bridge.

As for the bugged output, it's simply impossible. Really. There is no way to get to the result without calculating all partial results and accumulating them precisely. The hexadecimal digits next to the result are not an additional checksum, these digits ARE the result and therfor validate that the calculation was 100% successful beyond any doubt (unless you cheat).

From a technical standpoint it is not necessary for the moderation to intervene here. No rule was broken, the timer works, no cheating happend; this is simply a hardware/software combination that is running faster. We don't know what happened in between these BIOS versions and it's quite possible that a patched errata had a huge performance impact on the 64 bit integer and/or double precision performance of Llano. The only way to find out would be to have a deep look inside the calculation to find out what instructions are actually performed and measure them. This could be a great find, so hell yeah .. if anybody sends me the mainboard I would go for it (I have a 3870K btw but no board).

yosarianilives · September 3, 2018

So one annoyance I've found with f4 on the ud4h is that it apparently doesn't have unlocked multiplier on my 3870k, not that it matters 3 ghz on f4 obliterates 4 ghz on f8a which does have multiplier control in bios. Will upload a video as soon as I can find a stopwatch that isn't my phone.

mickulty · September 3, 2018

1 minute ago, _mat_ said:

We don't know what happened in between these BIOS versions and it's quite possible that a patched errata had a huge performance impact on the 64 bit integer and/or double precision performance of Llano. The only way to find out would be to have a deep look inside the calculation to find out what instructions are actually performed and measure them. This could be a great find, so hell yeah .. if anybody sends me the mainboard I would go for it (I have a 3870K btw but no board).

You're in the EU right? Shipping shouldn't be too painful and I have 3 other team members working on the stage so no issue for the comp, PM me your address and I can send you my UD4H.

_mat_ · September 3, 2018

3 minutes ago, mickulty said:

You're in the EU right? Shipping shouldn't be too painful and I have 3 other team members working on the stage so no issue for the comp, PM me your address and I can send you my UD4H.

You got a PM. Let's get to the bottom of this. ?

havli · September 3, 2018

11 minutes ago, _mat_ said:

SSE2 and SSE3 are the most important extensions for GPUPI using OpenCL on CPUs. SSE4a (also supported by Llano), SSE4.1 and SSE4.2 don't add anything particular interesting for the calculation, so that shouldn't make any difference in comparison to Ivy Bridge.

Well, in that case how do you explain huge performance advantage of 45nm Core2 (SSE 4.1) over 65nm Core2 (SSSE3) ? For example http://hwbot.org/submission/3678638_havli_gpupi_for_cpu___100m_core_2_duo_e8300_1min_53sec_951ms

and http://hwbot.org/submission/3408835_kintaro_gpupi_for_cpu___100m_core_2_duo_e6750_2min_16sec_234ms/

Or AMD 15h or 16h (AVX, SSE 4.2) much faster than K10 (SSE3)? in GPUPI K10 is a lot slower, while in older benchmarks - Cinebench for instance it is the other way around. http://hwbot.org/submission/3691480_havli_gpupi_for_cpu___100m_a10_7800_1min_8sec_764ms

and http://hwbot.org/submission/3554886_noms_gpupi_for_cpu___100m_phenom_ii_x4_965_be_1min_28sec_609ms

yosarianilives · September 3, 2018

Just now, havli said:

Well, in that case how do you explain huge performance advantage of 45nm Core2 (SSE 4.1) over 65nm Core2 (SSSE3) ? For example http://hwbot.org/submission/3678638_havli_gpupi_for_cpu___100m_core_2_duo_e8300_1min_53sec_951ms

and http://hwbot.org/submission/3408835_kintaro_gpupi_for_cpu___100m_core_2_duo_e6750_2min_16sec_234ms/

Or AMD 15h or 16h (AVX, SSE 4.2) much faster than K10 (SSE3)? in GPUPI K10 is a lot slower, while in older benchmarks - Cinebench for instance it is the other way around. http://hwbot.org/submission/3691480_havli_gpupi_for_cpu___100m_a10_7800_1min_8sec_764ms

and http://hwbot.org/submission/3554886_noms_gpupi_for_cpu___100m_phenom_ii_x4_965_be_1min_28sec_609ms

I don't think it's sse4, it's probably ssse3 which is also the reason that you can't run timespy cpu test on k10 cpus and also why 775 destroys k10 for a lot of benches.

havli · September 3, 2018

This is why I picked 45 vs 65nm Core2, these differ only in SSSE3 / SSE 4.1 (we can disregard extra cache here). There must be some use of SSE 4.1, otherwise 4.6 GHz Conroe would always beat 4.1 GHz Wolfdale.

_mat_ · September 3, 2018

The double to integer conversion instructions might be used, but it's hard to say. There could also be a specific OpenCL code path. I don't like that kind of code magic, that's why the new native path is coded by hand. And even then it depends on many different factors if the CPU can be fully used or any bottlenecks occur.

Edited September 3, 2018 by _mat_

yosarianilives · September 3, 2018

1 hour ago, _mat_ said:

The double to integer conversion instructions might be used, but it's hard to say. There could also be a specific OpenCL code path. I don't like that kind of code magic, that's why the new native path is coded by hand. And even then it depends on many different factors if the CPU can be fully used or any bottlenecks occur.

This got me thinking and so I tested 2.3.4, it's even faster than 3.1 or 3.2. Got 1m 18s in both 3.1 and 3.2, got 1m 13s in 2.3.4. Finally found a stopwatch so should have a video up shortly.

Edit: So just tested on bios f5 (newer, slower, agesa) and both 3.2 and 3.1 are about a second faster than 2.3.4, need to test on r8a (even newer agesa)

on f7 (new cpu microcode, not agesa) 2.3.4 is once again about 5-6 seconds faster and 3.1/3.2 score identically as they did in f5.

in f8a 2.3.4 is only about 2 seconds faster than 3.2/3.1 so the new agesa once again changes the balance.

Edited September 3, 2018 by yosarianilives

yosarianilives · September 3, 2018

Finally got it uploaded, this is with bios f4, it should clear up any confusion of if this is a timekeeping bug or something else. https://www.youtube.com/watch?v=A5JgJpcPoek

unityofsaints · September 3, 2018

Ok, this discussion has gone on for quite some time now. Although I appreciate @_mat_'s techinical discussion and @mickulty's help in getting him a board, finding the exact root cause of these speed differences will only help for future versions of GPUpi, not for this competition.

Therefore the more urgent need in my mind is a for a final moderation decision for this stage. There are 3 possibilities, some more realistic than others:

Allow only slow AGESA versions in the competition
Allow all AGESA versions in the competition
scrap the stage

Since people have already bought hardware specifically for this stage, I don't think 3) is a good option. That leaves 1) and 2). 1) has a higher moderation overhead, since mod(s) will have to look at the motherboard tab of every submission in this stage. 2) has a lower moderation overhead and the added benefit of making in-competition submissions competitive with submissions outside the competition. If 1) is picked, we could have a submission on air 1 day after the competition that is faster than an LN2 submission inside the competition.

Having someone with a F1A75-V PRO or F1A75-M LE flash back to the oldest BIOS and test it would help here as it would give us enough confidence that this is not a Gigabyte-specific thing (however unlikely that may be).

TBH after that I'd suggest moving this whole discussion into the GPUpi development thread.. there are 7 more stages to talk about in the DDR3 subcategory! ?

_mat_ · September 4, 2018

Leeghoofd has tested old BIOS versions on an ASUS board as well and the performance boost can be replicated there as well.

It will be fun to get behind this but you are right. Leeghoofd should decide what's best for the competition and I will start a new thread as soon as I get the board to report my findings.

havli · September 4, 2018

Did someone try to measure power consumption during these slow / fast runs? It could give some hint - if the extra work is being done, then the CPU also should draw more power.

yosarianilives · September 4, 2018

4 minutes ago, havli said:

Did someone try to measure power consumption during these slow / fast runs? It could give some hint - if the extra work is being done, then the CPU also should draw more power.

Tonight when I get home from work I can do some testing if I find my watt meter

unityofsaints · September 4, 2018

13 minutes ago, havli said:

Did someone try to measure power consumption during these slow / fast runs? It could give some hint - if the extra work is being done, then the CPU also should draw more power.

Maybe not idea testing conditions as I am only on air at the moment but the slow AGESA fluctuated 104 - 105W at the wall throughout the run while the faster one was 108 - 109W. For reference, idle power draw is 43W.

Edited September 4, 2018 by unityofsaints

yosarianilives · September 10, 2018

For stage 7 does R9 390/x count as Hawaii?

cbjaust · September 10, 2018

3 hours ago, yosarianilives said:

For stage 7 does R9 390/x count as Hawaii?

Of Course, why would they not? R9 280X is Tahiti as well.

yosarianilives · September 10, 2018

27 minutes ago, cbjaust said:

Of Course, why would they not? R9 280X is Tahiti as well.

In the database they are listed as Grenada even though they're Hawaii. As there would need to be hw purchased I'd like to verify before hand.

ozzie · September 10, 2018

280 is tahiti, not hawaii, 285 is tonga, 290 is hawaii, ,

cbjaust · September 10, 2018

3 hours ago, yosarianilives said:

In the database they are listed as Grenada even though they're Hawaii. As there would need to be hw purchased I'd like to verify before hand.

Yeah true, GPU-z identifies R9 390X as Hawaii though. I consider them to be Hawaii.

2 hours ago, ozzie said:

280 is tahiti, not hawaii, 285 is tonga, 290 is hawaii, ,

that's what I said, didn't I? ?

mickulty · September 10, 2018

To be fair to yos, if Zosma isn't Thuban it's fair to wanna be sure that Grenada is Hawaii before buying a card.

The Official Team CUP 2018 DDR3 stage thread:

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

_mat_

Leeghoofd

mickulty

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation