Jump to content
HWBOT Community Forums
Mysticial

A Favor to Ask: Skylake X and AVX512

Recommended Posts

Right now, there are conflicting reports that this first line of Skylake X processors (based on the 10-core Skylake Purley LCC die) will not have full-throughput AVX512.

If this is true, the current Skylake X processors will only be able to run AVX512 at half the speed as the server Xeons - IOW, no better than AVX2.

 

I want to definitively answer this question - both for myself and for anyone else looking to purchase a Skylake X processor for the purpose of AVX512.

Using the same FLOPs benchmark that discovered the Ryzen FMA bug, we should be able to find out if Skylake X has full-throughput, or half-throughput AVX512.

 

So my request for someone who has a Skylake X sample* to:

  1. Run the "2017-SkylakePurley" binary here: https://github.com/Mysticial/Flops/tree/master/version3/binaries-windows**
  2. Do it at a fixed CPU frequency (to avoid the affects of Turbo Boost).
  3. Do it with HT enabled.
  4. Don't use an extreme overclock. If the chip has full-throughput AVX512, then those AVX512 instructions may produce more heat than any other benchmark you've ever run.
  5. Do it with a fully updated Windows 10. Or a recent version of Linux (like Ubuntu 17.04). This is needed to ensure that the OS has support for AVX512.

*I may be wrong, but I don't believe Skylake X benchmarks are under NDA anymore since there's already a gazillion HWBOT submissions and you can get access to the server variants on Google Cloud.

 

**The source code is also in that GitHub repo if you want to build it yourself. But be aware that if you need the Intel Compiler if you want to build the AVX512 binaries for Windows.

 

----------------

 

When you run the benchmark, I expect one of 3 things to happen:

  1. The binary crashes: This means that Windows 10 does not have support for AVX512 and we'll need to wait for that support.
  2. The numbers for 512-bit AVX are about the same as the 256-bit AVX: This means that the processor only supports half-throughput AVX512.
  3. The numbers for the 512-bit AVX are about 2x as that of the 256-bit AVX: This means that the processor supports full-throughput AVX512.

 

Here is what the benchmark looks like for a 32-core Skylake Purley system on Google Cloud running at 2.0 GHz with 2.5 GHz turbo:

 

Running Skylake Purley tuned binary with 1 thread...

Single-Precision - 128-bit AVX - Add/Sub
   GFlops = 15.904
   Result = 2.02376e+06

Double-Precision - 128-bit AVX - Add/Sub
   GFlops = 7.952
   Result = 1.00995e+06

Single-Precision - 128-bit AVX - Multiply
   GFlops = 15.936
   Result = 2.03498e+06

Double-Precision - 128-bit AVX - Multiply
   GFlops = 7.968
   Result = 1.00712e+06

Single-Precision - 128-bit AVX - Multiply + Add
   GFlops = 15.936
   Result = 1.69085e+06

Double-Precision - 128-bit AVX - Multiply + Add
   GFlops = 7.968
   Result = 841756

Single-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 31.872
   Result = 2.02868e+06

Double-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 15.936
   Result = 1.01782e+06

Single-Precision - 256-bit AVX - Add/Sub
   GFlops = 31.808
   Result = 4.06688e+06

Double-Precision - 256-bit AVX - Add/Sub
   GFlops = 15.936
   Result = 2.02901e+06

Single-Precision - 256-bit AVX - Multiply
   GFlops = 31.872
   Result = 4.06158e+06

Double-Precision - 256-bit AVX - Multiply
   GFlops = 15.936
   Result = 2.02013e+06

Single-Precision - 256-bit AVX - Multiply + Add
   GFlops = 31.872
   Result = 3.34696e+06

Double-Precision - 256-bit AVX - Multiply + Add
   GFlops = 15.936
   Result = 1.70441e+06

Single-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 63.744
   Result = 4.0399e+06

Double-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 31.872
   Result = 2.00801e+06

Single-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 63.744
   Result = 8.11456e+06

Double-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 31.872
   Result = 4.03949e+06

Single-Precision - 512-bit AVX512 - Multiply
   GFlops = 63.36
   Result = 8.0743e+06

Double-Precision - 512-bit AVX512 - Multiply
   GFlops = 31.872
   Result = 4.05014e+06

Single-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 63.744
   Result = 6.68723e+06

Double-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 31.872
   Result = 3.3739e+06

Single-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 127.488
   Result = 8.22848e+06

Double-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 63.744
   Result = 4.03805e+06


Running Skylake Purley tuned binary with 64 thread(s)...

Single-Precision - 128-bit AVX - Add/Sub
   GFlops = 683.36
   Result = 8.68179e+07

Double-Precision - 128-bit AVX - Add/Sub
   GFlops = 263.568
   Result = 3.35065e+07

Single-Precision - 128-bit AVX - Multiply
   GFlops = 527.616
   Result = 6.69453e+07

Double-Precision - 128-bit AVX - Multiply
   GFlops = 263.88
   Result = 3.34619e+07

Single-Precision - 128-bit AVX - Multiply + Add
   GFlops = 527.136
   Result = 5.58561e+07

Double-Precision - 128-bit AVX - Multiply + Add
   GFlops = 263.64
   Result = 2.79832e+07

Single-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 1056.77
   Result = 6.71142e+07

Double-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 528.336
   Result = 3.36188e+07

Single-Precision - 256-bit AVX - Add/Sub
   GFlops = 1054.14
   Result = 1.34076e+08

Double-Precision - 256-bit AVX - Add/Sub
   GFlops = 527.52
   Result = 6.68866e+07

Single-Precision - 256-bit AVX - Multiply
   GFlops = 1056.77
   Result = 1.34416e+08

Double-Precision - 256-bit AVX - Multiply
   GFlops = 527.664
   Result = 6.70251e+07

Single-Precision - 256-bit AVX - Multiply + Add
   GFlops = 1055.33
   Result = 1.12018e+08

Double-Precision - 256-bit AVX - Multiply + Add
   GFlops = 527.52
   Result = 5.59086e+07

Single-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 2110.08
   Result = 1.34046e+08

Double-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 1055.33
   Result = 6.69451e+07

Single-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 2112.26
   Result = 2.68216e+08

Double-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 1056
   Result = 1.34131e+08

Single-Precision - 512-bit AVX512 - Multiply
   GFlops = 2117.38
   Result = 2.69031e+08

Double-Precision - 512-bit AVX512 - Multiply
   GFlops = 1059.26
   Result = 1.34601e+08

Single-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 2118.14
   Result = 2.24393e+08

Double-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 1058.5
   Result = 1.12102e+08

Single-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 4242.43
   Result = 2.69409e+08

Double-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 2115.07
   Result = 1.34365e+08

 

This Skylake Purley system has full-throughput AVX512.

Share this post


Link to post
Share on other sites

Bump. NDAs lifting today.

 

I'm most curious about the 7820X and the 7900X.

 

EDIT:

 

The reviews seems to indicate that the 6 and 8-core models will have half-throughput, and the 10-core model will have full-throughput. Microarchitecture Analysis: Adding in AVX-512 and Tweaks to Skylake-S - The Intel Skylake-X Review: Core i9 7900X, i7 7820X and i7 7800X Tested

Edited by Mysticial

Share this post


Link to post
Share on other sites
attachment.php?attachmentid=5756&stc=1&d=1498048762

 

Windows 10 1703 with Intel C++ redists installed.

 

Thank you!

 

This is interesting though. The compiler seems to be trying to enforce that the computer has RDSEED instructions. But RDSEED was already available starting from Broadwell. I don't see why it would be missing from Skylake X unless it was explicitly disabled in the BIOS or something.

 

This might be a problem moving forward since the compiler forces these checks even though most programs won't use them anyway.

 

EDIT:

 

Is virtualization disabled in the BIOS? I'm reading around and it seems that some machines have all the crypto instructions disabled (AES-NI, RDRAND, and RDSEED) and it may be related to virtualization.

Edited by Mysticial

Share this post


Link to post
Share on other sites

I found a way to disable that check by the compiler and I've updated the binaries.

 

So if anyone is willing to try now, it should (hopefully) work regardless of whether RDSEED is enabled or not.

 

Thanks.

Share this post


Link to post
Share on other sites
Thank you!

 

This is interesting though. The compiler seems to be trying to enforce that the computer has RDSEED instructions. But RDSEED was already available starting from Broadwell. I don't see why it would be missing from Skylake X unless it was explicitly disabled in the BIOS or something.

 

This might be a problem moving forward since the compiler forces these checks even though most programs won't use them anyway.

 

EDIT:

 

Is virtualization disabled in the BIOS? I'm reading around and it seems that some machines have all the crypto instructions disabled (AES-NI, RDRAND, and RDSEED) and it may be related to virtualization.

 

It was, but still get the same after enabling it.

Share this post


Link to post
Share on other sites
It was, but still get the same after enabling it.

 

Would you be able to try with the latest binaries? I updated them last night.

 

As far as I can tell, I've removed the check. So it should get past that message and either run successfully or crash.

 

Thanks for you time.

Share this post


Link to post
Share on other sites
Would you be able to try with the latest binaries? I updated them last night.

 

As far as I can tell, I've removed the check. So it should get past that message and either run successfully or crash.

 

Thanks for you time.

 

Works fine here with prior binaries will test later with latest.

Share this post


Link to post
Share on other sites
Works fine here with prior binaries will test later with latest.

 

I don't believe you, send me your X299 gear so I can see first hand. :P

Share this post


Link to post
Share on other sites
It was, but still get the same after enabling it.

 

//Pieter update on Elmor test//

 

RDSEED not supported was on early 12c. Works on 7900X.

 

Single Precision - Add/subtract

 

AVX128 = 1.00x

AVX256 = 1.84x

AVX512 = 3.51x

Edited by Massman

Share this post


Link to post
Share on other sites
//Pieter update on Elmor test//

 

RDSEED not supported was on early 12c. Works on 7900X.

 

Single Precision - Add/subtract

 

AVX128 = 1.00x

AVX256 = 1.84x

AVX512 = 3.51x

 

That's good to see! :D Full output?

 

Though I'm seeing rumors that the integer throughput will not be doubled. And I can see architecturally why that might be. Unfortunately I don't have a benchmark for that.

Share this post


Link to post
Share on other sites
Here is my testing, how many Gflops? All the Gflops :)

 

Imgur: The most awesome images on the Internet

 

EeThD.jpg

 

 

 

Will test the 6c on Monday.

 

Wow! Over 1 TFlops for double-precision! :eek:

 

CPUz doesn't seem accurate in that screenshot. But based on the numbers it looks like you were clocked around 4.5 GHz? Possibly in 100 x 45 configuration?

 

And it didn't melt?

Share this post


Link to post
Share on other sites
Wow! Over 1 TFlops for double-precision! :eek:

 

CPUz doesn't seem accurate in that screenshot. But based on the numbers it looks like you were clocked around 4.5 GHz? Possibly in 100 x 45 configuration?

 

And it didn't melt?

 

 

4.5Ghz (45x100) on the dot, 1.2vcore with a Corsair H110i AIO.

 

Will try with my good ram on Monday I forgot them at the house and see if it scales at all.

 

Mesh was at 2.8Ghz for the above screenie.

Edited by l0ud_sil3nc3

Share this post


Link to post
Share on other sites
4.5Ghz (45x100) on the dot, 1.2vcore with a Corsair H110i AIO.

 

Will try with my good ram on Monday I forgot them at the house and see if it scales at all.

 

Mesh was at 3Ghz for the above screenie.

 

Ram will have no effect on that benchmark. The benchmark is 100% CPU.

 

I was able to calculate your clock speed because the benchmark achieves very close to the theoretical FLOPs on the system.

 

For the Core i9 7900X assuming full-throughput AVX512:

 

(2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (10 cores) * (4.5 GHz) = 1440 GFlops

 

The benchmark is showing 1443.84 GFlops. It's actually slightly more than the theoretical limit because of timing variations.

Share this post


Link to post
Share on other sites
Ram will have no effect on that benchmark. The benchmark is 100% CPU.

 

I was able to calculate your clock speed because the benchmark achieves very close to the theoretical FLOPs on the system.

 

For the Core i9 7900X assuming full-throughput AVX512:

 

(2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (10 cores) * (4.5 GHz) = 1440 GFlops

 

The benchmark is showing 1443.84 GFlops. It's actually slightly more than the theoretical limit because of timing variations.

 

 

You're in luck I'm staying late just popped in the 6c for science bro :D

 

Results incoming?

 

Not sure if there's an 8c laying around here. . . .

Share this post


Link to post
Share on other sites
6c results 4.5 (45x100) 1.2vcore:

 

Imgur: The most awesome images on the Internet

 

No 8c in office atm :(

 

Now THAT's interesting... They also show full-throughput AVX512. That's contrary to what all the articles out there are reporting.

 

(2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (6 cores) * (4.5 GHz) = 864 GFlops

 

Benchmark shows 872.832 GFlops.

 

If they were only half-throughput, I'd have expected:

 

(1 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (6 cores) * (4.5 GHz) = 432 GFlops

 

Thanks for running all these benchmarks!

Share this post


Link to post
Share on other sites
King of the FLOPS!

 

WQie9Ft.png

 

WOW... :eek: AVX512 @ 4.5 GHz. How many watts did it pull? :o

 

That also confirms full-throughput AVX512 (both FMAs enabled) for the 7980XE.

 

*I love Intel calls this a "1 teraflop" CPU when it's really 2 - 3 TFLOPs (stock), or 5 TFLOPs (here at 4.5 GHz).

Edited by Mysticial

Share this post


Link to post
Share on other sites
WOW... :eek: AVX512 @ 4.5 GHz. How many watts did it pull?

 

That also confirms full-throughput AVX512 (both FMAs enabled) for the 7980XE.

 

*I love Intel calls this a "1 teraflop" CPU when it's really 2 - 3 TFLOPs (stock), or 5 TFLOPs (here at 4.5 GHz).

 

Didn't have the WattsPro hooked up, but this was with 1.25v Vcore. To give you an idea, 1.3v at 4.8 R15 is roughly 800ish watts, and I have seen over 1000w on LN2 pretty easily if you don't optimize Vcore.

 

This is the Chiron of cpu's :D

Share this post


Link to post
Share on other sites
Many psu will burn with this beast .

 

 

Sent from my iPhone using Tapatalk Pro

 

This setup was utilizing an EVGA 1300w G2, no issues up to 1.35v on water, but I have not tested this particular unit @ cold.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×