Jump to content
HWBOT Community Forums

Recommended Posts

Posted

Right now, there are conflicting reports that this first line of Skylake X processors (based on the 10-core Skylake Purley LCC die) will not have full-throughput AVX512.

If this is true, the current Skylake X processors will only be able to run AVX512 at half the speed as the server Xeons - IOW, no better than AVX2.

 

I want to definitively answer this question - both for myself and for anyone else looking to purchase a Skylake X processor for the purpose of AVX512.

Using the same FLOPs benchmark that discovered the Ryzen FMA bug, we should be able to find out if Skylake X has full-throughput, or half-throughput AVX512.

 

So my request for someone who has a Skylake X sample* to:

  1. Run the "2017-SkylakePurley" binary here: https://github.com/Mysticial/Flops/tree/master/version3/binaries-windows**
  2. Do it at a fixed CPU frequency (to avoid the affects of Turbo Boost).
  3. Do it with HT enabled.
  4. Don't use an extreme overclock. If the chip has full-throughput AVX512, then those AVX512 instructions may produce more heat than any other benchmark you've ever run.
  5. Do it with a fully updated Windows 10. Or a recent version of Linux (like Ubuntu 17.04). This is needed to ensure that the OS has support for AVX512.

*I may be wrong, but I don't believe Skylake X benchmarks are under NDA anymore since there's already a gazillion HWBOT submissions and you can get access to the server variants on Google Cloud.

 

**The source code is also in that GitHub repo if you want to build it yourself. But be aware that if you need the Intel Compiler if you want to build the AVX512 binaries for Windows.

 

----------------

 

When you run the benchmark, I expect one of 3 things to happen:

  1. The binary crashes: This means that Windows 10 does not have support for AVX512 and we'll need to wait for that support.
  2. The numbers for 512-bit AVX are about the same as the 256-bit AVX: This means that the processor only supports half-throughput AVX512.
  3. The numbers for the 512-bit AVX are about 2x as that of the 256-bit AVX: This means that the processor supports full-throughput AVX512.

 

Here is what the benchmark looks like for a 32-core Skylake Purley system on Google Cloud running at 2.0 GHz with 2.5 GHz turbo:

 

Running Skylake Purley tuned binary with 1 thread...

Single-Precision - 128-bit AVX - Add/Sub
   GFlops = 15.904
   Result = 2.02376e+06

Double-Precision - 128-bit AVX - Add/Sub
   GFlops = 7.952
   Result = 1.00995e+06

Single-Precision - 128-bit AVX - Multiply
   GFlops = 15.936
   Result = 2.03498e+06

Double-Precision - 128-bit AVX - Multiply
   GFlops = 7.968
   Result = 1.00712e+06

Single-Precision - 128-bit AVX - Multiply + Add
   GFlops = 15.936
   Result = 1.69085e+06

Double-Precision - 128-bit AVX - Multiply + Add
   GFlops = 7.968
   Result = 841756

Single-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 31.872
   Result = 2.02868e+06

Double-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 15.936
   Result = 1.01782e+06

Single-Precision - 256-bit AVX - Add/Sub
   GFlops = 31.808
   Result = 4.06688e+06

Double-Precision - 256-bit AVX - Add/Sub
   GFlops = 15.936
   Result = 2.02901e+06

Single-Precision - 256-bit AVX - Multiply
   GFlops = 31.872
   Result = 4.06158e+06

Double-Precision - 256-bit AVX - Multiply
   GFlops = 15.936
   Result = 2.02013e+06

Single-Precision - 256-bit AVX - Multiply + Add
   GFlops = 31.872
   Result = 3.34696e+06

Double-Precision - 256-bit AVX - Multiply + Add
   GFlops = 15.936
   Result = 1.70441e+06

Single-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 63.744
   Result = 4.0399e+06

Double-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 31.872
   Result = 2.00801e+06

Single-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 63.744
   Result = 8.11456e+06

Double-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 31.872
   Result = 4.03949e+06

Single-Precision - 512-bit AVX512 - Multiply
   GFlops = 63.36
   Result = 8.0743e+06

Double-Precision - 512-bit AVX512 - Multiply
   GFlops = 31.872
   Result = 4.05014e+06

Single-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 63.744
   Result = 6.68723e+06

Double-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 31.872
   Result = 3.3739e+06

Single-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 127.488
   Result = 8.22848e+06

Double-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 63.744
   Result = 4.03805e+06


Running Skylake Purley tuned binary with 64 thread(s)...

Single-Precision - 128-bit AVX - Add/Sub
   GFlops = 683.36
   Result = 8.68179e+07

Double-Precision - 128-bit AVX - Add/Sub
   GFlops = 263.568
   Result = 3.35065e+07

Single-Precision - 128-bit AVX - Multiply
   GFlops = 527.616
   Result = 6.69453e+07

Double-Precision - 128-bit AVX - Multiply
   GFlops = 263.88
   Result = 3.34619e+07

Single-Precision - 128-bit AVX - Multiply + Add
   GFlops = 527.136
   Result = 5.58561e+07

Double-Precision - 128-bit AVX - Multiply + Add
   GFlops = 263.64
   Result = 2.79832e+07

Single-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 1056.77
   Result = 6.71142e+07

Double-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 528.336
   Result = 3.36188e+07

Single-Precision - 256-bit AVX - Add/Sub
   GFlops = 1054.14
   Result = 1.34076e+08

Double-Precision - 256-bit AVX - Add/Sub
   GFlops = 527.52
   Result = 6.68866e+07

Single-Precision - 256-bit AVX - Multiply
   GFlops = 1056.77
   Result = 1.34416e+08

Double-Precision - 256-bit AVX - Multiply
   GFlops = 527.664
   Result = 6.70251e+07

Single-Precision - 256-bit AVX - Multiply + Add
   GFlops = 1055.33
   Result = 1.12018e+08

Double-Precision - 256-bit AVX - Multiply + Add
   GFlops = 527.52
   Result = 5.59086e+07

Single-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 2110.08
   Result = 1.34046e+08

Double-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 1055.33
   Result = 6.69451e+07

Single-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 2112.26
   Result = 2.68216e+08

Double-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 1056
   Result = 1.34131e+08

Single-Precision - 512-bit AVX512 - Multiply
   GFlops = 2117.38
   Result = 2.69031e+08

Double-Precision - 512-bit AVX512 - Multiply
   GFlops = 1059.26
   Result = 1.34601e+08

Single-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 2118.14
   Result = 2.24393e+08

Double-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 1058.5
   Result = 1.12102e+08

Single-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 4242.43
   Result = 2.69409e+08

Double-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 2115.07
   Result = 1.34365e+08

 

This Skylake Purley system has full-throughput AVX512.

  • 2 weeks later...
Posted (edited)
attachment.php?attachmentid=5756&stc=1&d=1498048762

 

Windows 10 1703 with Intel C++ redists installed.

 

Thank you!

 

This is interesting though. The compiler seems to be trying to enforce that the computer has RDSEED instructions. But RDSEED was already available starting from Broadwell. I don't see why it would be missing from Skylake X unless it was explicitly disabled in the BIOS or something.

 

This might be a problem moving forward since the compiler forces these checks even though most programs won't use them anyway.

 

EDIT:

 

Is virtualization disabled in the BIOS? I'm reading around and it seems that some machines have all the crypto instructions disabled (AES-NI, RDRAND, and RDSEED) and it may be related to virtualization.

Edited by Mysticial
Posted

I found a way to disable that check by the compiler and I've updated the binaries.

 

So if anyone is willing to try now, it should (hopefully) work regardless of whether RDSEED is enabled or not.

 

Thanks.

Posted
Thank you!

 

This is interesting though. The compiler seems to be trying to enforce that the computer has RDSEED instructions. But RDSEED was already available starting from Broadwell. I don't see why it would be missing from Skylake X unless it was explicitly disabled in the BIOS or something.

 

This might be a problem moving forward since the compiler forces these checks even though most programs won't use them anyway.

 

EDIT:

 

Is virtualization disabled in the BIOS? I'm reading around and it seems that some machines have all the crypto instructions disabled (AES-NI, RDRAND, and RDSEED) and it may be related to virtualization.

 

It was, but still get the same after enabling it.

Posted
It was, but still get the same after enabling it.

 

Would you be able to try with the latest binaries? I updated them last night.

 

As far as I can tell, I've removed the check. So it should get past that message and either run successfully or crash.

 

Thanks for you time.

Posted
Would you be able to try with the latest binaries? I updated them last night.

 

As far as I can tell, I've removed the check. So it should get past that message and either run successfully or crash.

 

Thanks for you time.

 

Works fine here with prior binaries will test later with latest.

Posted (edited)
It was, but still get the same after enabling it.

 

//Pieter update on Elmor test//

 

RDSEED not supported was on early 12c. Works on 7900X.

 

Single Precision - Add/subtract

 

AVX128 = 1.00x

AVX256 = 1.84x

AVX512 = 3.51x

Edited by Massman
Posted
//Pieter update on Elmor test//

 

RDSEED not supported was on early 12c. Works on 7900X.

 

Single Precision - Add/subtract

 

AVX128 = 1.00x

AVX256 = 1.84x

AVX512 = 3.51x

 

That's good to see! :D Full output?

 

Though I'm seeing rumors that the integer throughput will not be doubled. And I can see architecturally why that might be. Unfortunately I don't have a benchmark for that.

Posted (edited)
Wow! Over 1 TFlops for double-precision! :eek:

 

CPUz doesn't seem accurate in that screenshot. But based on the numbers it looks like you were clocked around 4.5 GHz? Possibly in 100 x 45 configuration?

 

And it didn't melt?

 

 

4.5Ghz (45x100) on the dot, 1.2vcore with a Corsair H110i AIO.

 

Will try with my good ram on Monday I forgot them at the house and see if it scales at all.

 

Mesh was at 2.8Ghz for the above screenie.

Edited by l0ud_sil3nc3
Posted
4.5Ghz (45x100) on the dot, 1.2vcore with a Corsair H110i AIO.

 

Will try with my good ram on Monday I forgot them at the house and see if it scales at all.

 

Mesh was at 3Ghz for the above screenie.

 

Ram will have no effect on that benchmark. The benchmark is 100% CPU.

 

I was able to calculate your clock speed because the benchmark achieves very close to the theoretical FLOPs on the system.

 

For the Core i9 7900X assuming full-throughput AVX512:

 

(2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (10 cores) * (4.5 GHz) = 1440 GFlops

 

The benchmark is showing 1443.84 GFlops. It's actually slightly more than the theoretical limit because of timing variations.

Posted
Ram will have no effect on that benchmark. The benchmark is 100% CPU.

 

I was able to calculate your clock speed because the benchmark achieves very close to the theoretical FLOPs on the system.

 

For the Core i9 7900X assuming full-throughput AVX512:

 

(2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (10 cores) * (4.5 GHz) = 1440 GFlops

 

The benchmark is showing 1443.84 GFlops. It's actually slightly more than the theoretical limit because of timing variations.

 

 

You're in luck I'm staying late just popped in the 6c for science bro :D

 

Results incoming?

 

Not sure if there's an 8c laying around here. . . .

Posted
6c results 4.5 (45x100) 1.2vcore:

 

Imgur: The most awesome images on the Internet

 

No 8c in office atm :(

 

Now THAT's interesting... They also show full-throughput AVX512. That's contrary to what all the articles out there are reporting.

 

(2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (6 cores) * (4.5 GHz) = 864 GFlops

 

Benchmark shows 872.832 GFlops.

 

If they were only half-throughput, I'd have expected:

 

(1 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (6 cores) * (4.5 GHz) = 432 GFlops

 

Thanks for running all these benchmarks!

  • 3 months later...
Posted (edited)
King of the FLOPS!

 

WQie9Ft.png

 

WOW... :eek: AVX512 @ 4.5 GHz. How many watts did it pull? :o

 

That also confirms full-throughput AVX512 (both FMAs enabled) for the 7980XE.

 

*I love Intel calls this a "1 teraflop" CPU when it's really 2 - 3 TFLOPs (stock), or 5 TFLOPs (here at 4.5 GHz).

Edited by Mysticial
Posted
WOW... :eek: AVX512 @ 4.5 GHz. How many watts did it pull?

 

That also confirms full-throughput AVX512 (both FMAs enabled) for the 7980XE.

 

*I love Intel calls this a "1 teraflop" CPU when it's really 2 - 3 TFLOPs (stock), or 5 TFLOPs (here at 4.5 GHz).

 

Didn't have the WattsPro hooked up, but this was with 1.25v Vcore. To give you an idea, 1.3v at 4.8 R15 is roughly 800ish watts, and I have seen over 1000w on LN2 pretty easily if you don't optimize Vcore.

 

This is the Chiron of cpu's :D

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...