A Favor to Ask: Skylake X and AVX512

Mysticial · June 8, 2017

Right now, there are conflicting reports that this first line of Skylake X processors (based on the 10-core Skylake Purley LCC die) will not have full-throughput AVX512.

If this is true, the current Skylake X processors will only be able to run AVX512 at half the speed as the server Xeons - IOW, no better than AVX2.

I want to definitively answer this question - both for myself and for anyone else looking to purchase a Skylake X processor for the purpose of AVX512.

Using the same FLOPs benchmark that discovered the Ryzen FMA bug, we should be able to find out if Skylake X has full-throughput, or half-throughput AVX512.

So my request for someone who has a Skylake X sample* to:

Run the "2017-SkylakePurley" binary here: https://github.com/Mysticial/Flops/tree/master/version3/binaries-windows**
Do it at a fixed CPU frequency (to avoid the affects of Turbo Boost).
Do it with HT enabled.
Don't use an extreme overclock. If the chip has full-throughput AVX512, then those AVX512 instructions may produce more heat than any other benchmark you've ever run.
Do it with a fully updated Windows 10. Or a recent version of Linux (like Ubuntu 17.04). This is needed to ensure that the OS has support for AVX512.

*I may be wrong, but I don't believe Skylake X benchmarks are under NDA anymore since there's already a gazillion HWBOT submissions and you can get access to the server variants on Google Cloud.

**The source code is also in that GitHub repo if you want to build it yourself. But be aware that if you need the Intel Compiler if you want to build the AVX512 binaries for Windows.

----------------

When you run the benchmark, I expect one of 3 things to happen:

The binary crashes: This means that Windows 10 does not have support for AVX512 and we'll need to wait for that support.
The numbers for 512-bit AVX are about the same as the 256-bit AVX: This means that the processor only supports half-throughput AVX512.
The numbers for the 512-bit AVX are about 2x as that of the 256-bit AVX: This means that the processor supports full-throughput AVX512.

Here is what the benchmark looks like for a 32-core Skylake Purley system on Google Cloud running at 2.0 GHz with 2.5 GHz turbo:

Running Skylake Purley tuned binary with 1 thread...

Single-Precision - 128-bit AVX - Add/Sub
   GFlops = 15.904
   Result = 2.02376e+06

Double-Precision - 128-bit AVX - Add/Sub
   GFlops = 7.952
   Result = 1.00995e+06

Single-Precision - 128-bit AVX - Multiply
   GFlops = 15.936
   Result = 2.03498e+06

Double-Precision - 128-bit AVX - Multiply
   GFlops = 7.968
   Result = 1.00712e+06

Single-Precision - 128-bit AVX - Multiply + Add
   GFlops = 15.936
   Result = 1.69085e+06

Double-Precision - 128-bit AVX - Multiply + Add
   GFlops = 7.968
   Result = 841756

Single-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 31.872
   Result = 2.02868e+06

Double-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 15.936
   Result = 1.01782e+06

Single-Precision - 256-bit AVX - Add/Sub
   GFlops = 31.808
   Result = 4.06688e+06

Double-Precision - 256-bit AVX - Add/Sub
   GFlops = 15.936
   Result = 2.02901e+06

Single-Precision - 256-bit AVX - Multiply
   GFlops = 31.872
   Result = 4.06158e+06

Double-Precision - 256-bit AVX - Multiply
   GFlops = 15.936
   Result = 2.02013e+06

Single-Precision - 256-bit AVX - Multiply + Add
   GFlops = 31.872
   Result = 3.34696e+06

Double-Precision - 256-bit AVX - Multiply + Add
   GFlops = 15.936
   Result = 1.70441e+06

Single-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 63.744
   Result = 4.0399e+06

Double-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 31.872
   Result = 2.00801e+06

Single-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 63.744
   Result = 8.11456e+06

Double-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 31.872
   Result = 4.03949e+06

Single-Precision - 512-bit AVX512 - Multiply
   GFlops = 63.36
   Result = 8.0743e+06

Double-Precision - 512-bit AVX512 - Multiply
   GFlops = 31.872
   Result = 4.05014e+06

Single-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 63.744
   Result = 6.68723e+06

Double-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 31.872
   Result = 3.3739e+06

Single-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 127.488
   Result = 8.22848e+06

Double-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 63.744
   Result = 4.03805e+06


Running Skylake Purley tuned binary with 64 thread(s)...

Single-Precision - 128-bit AVX - Add/Sub
   GFlops = 683.36
   Result = 8.68179e+07

Double-Precision - 128-bit AVX - Add/Sub
   GFlops = 263.568
   Result = 3.35065e+07

Single-Precision - 128-bit AVX - Multiply
   GFlops = 527.616
   Result = 6.69453e+07

Double-Precision - 128-bit AVX - Multiply
   GFlops = 263.88
   Result = 3.34619e+07

Single-Precision - 128-bit AVX - Multiply + Add
   GFlops = 527.136
   Result = 5.58561e+07

Double-Precision - 128-bit AVX - Multiply + Add
   GFlops = 263.64
   Result = 2.79832e+07

Single-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 1056.77
   Result = 6.71142e+07

Double-Precision - 128-bit FMA3 - Fused Multiply Add
   GFlops = 528.336
   Result = 3.36188e+07

Single-Precision - 256-bit AVX - Add/Sub
   GFlops = 1054.14
   Result = 1.34076e+08

Double-Precision - 256-bit AVX - Add/Sub
   GFlops = 527.52
   Result = 6.68866e+07

Single-Precision - 256-bit AVX - Multiply
   GFlops = 1056.77
   Result = 1.34416e+08

Double-Precision - 256-bit AVX - Multiply
   GFlops = 527.664
   Result = 6.70251e+07

Single-Precision - 256-bit AVX - Multiply + Add
   GFlops = 1055.33
   Result = 1.12018e+08

Double-Precision - 256-bit AVX - Multiply + Add
   GFlops = 527.52
   Result = 5.59086e+07

Single-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 2110.08
   Result = 1.34046e+08

Double-Precision - 256-bit FMA3 - Fused Multiply Add
   GFlops = 1055.33
   Result = 6.69451e+07

Single-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 2112.26
   Result = 2.68216e+08

Double-Precision - 512-bit AVX512 - Add/Sub
   GFlops = 1056
   Result = 1.34131e+08

Single-Precision - 512-bit AVX512 - Multiply
   GFlops = 2117.38
   Result = 2.69031e+08

Double-Precision - 512-bit AVX512 - Multiply
   GFlops = 1059.26
   Result = 1.34601e+08

Single-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 2118.14
   Result = 2.24393e+08

Double-Precision - 512-bit AVX512 - Multiply + Add
   GFlops = 1058.5
   Result = 1.12102e+08

Single-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 4242.43
   Result = 2.69409e+08

Double-Precision - 512-bit AVX512 - Fused Multiply Add
   GFlops = 2115.07
   Result = 1.34365e+08

This Skylake Purley system has full-throughput AVX512.

Massman · June 9, 2017

Fired off some emails

Mysticial · June 19, 2017

Bump. NDAs lifting today.

I'm most curious about the 7820X and the 7900X.

EDIT:

The reviews seems to indicate that the 6 and 8-core models will have half-throughput, and the 10-core model will have full-throughput. Microarchitecture Analysis: Adding in AVX-512 and Tweaks to Skylake-S - The Intel Skylake-X Review: Core i9 7900X, i7 7820X and i7 7800X Tested

Edited June 19, 2017 by Mysticial

elmor · June 21, 2017

attachment.php?attachmentid=5756&stc=1&d=1498048762

Windows 10 1703 with Intel C++ redists installed.

Mysticial · June 21, 2017

Windows 10 1703 with Intel C++ redists installed.

Thank you!

This is interesting though. The compiler seems to be trying to enforce that the computer has RDSEED instructions. But RDSEED was already available starting from Broadwell. I don't see why it would be missing from Skylake X unless it was explicitly disabled in the BIOS or something.

This might be a problem moving forward since the compiler forces these checks even though most programs won't use them anyway.

EDIT:

Is virtualization disabled in the BIOS? I'm reading around and it seems that some machines have all the crypto instructions disabled (AES-NI, RDRAND, and RDSEED) and it may be related to virtualization.

Edited June 21, 2017 by Mysticial

Mysticial · June 22, 2017

I found a way to disable that check by the compiler and I've updated the binaries.

So if anyone is willing to try now, it should (hopefully) work regardless of whether RDSEED is enabled or not.

Thanks.

elmor · June 22, 2017

Thank you!

This is interesting though. The compiler seems to be trying to enforce that the computer has RDSEED instructions. But RDSEED was already available starting from Broadwell. I don't see why it would be missing from Skylake X unless it was explicitly disabled in the BIOS or something.

This might be a problem moving forward since the compiler forces these checks even though most programs won't use them anyway.

EDIT:

Is virtualization disabled in the BIOS? I'm reading around and it seems that some machines have all the crypto instructions disabled (AES-NI, RDRAND, and RDSEED) and it may be related to virtualization.

It was, but still get the same after enabling it.

Mysticial · June 22, 2017

It was, but still get the same after enabling it.

Would you be able to try with the latest binaries? I updated them last night.

As far as I can tell, I've removed the check. So it should get past that message and either run successfully or crash.

Thanks for you time.

l0ud_sil3nc3 · June 22, 2017

Would you be able to try with the latest binaries? I updated them last night.

As far as I can tell, I've removed the check. So it should get past that message and either run successfully or crash.

Thanks for you time.

Works fine here with prior binaries will test later with latest.

Gunslinger · June 22, 2017

Works fine here with prior binaries will test later with latest.

I don't believe you, send me your X299 gear so I can see first hand.

l0ud_sil3nc3 · June 22, 2017

I don't believe you, send me your X299 gear so I can see first hand.

No probrem Gunny

Massman · June 23, 2017

It was, but still get the same after enabling it.

//Pieter update on Elmor test//

RDSEED not supported was on early 12c. Works on 7900X.

Single Precision - Add/subtract

AVX128 = 1.00x

AVX256 = 1.84x

AVX512 = 3.51x

Edited June 23, 2017 by Massman

Mysticial · June 23, 2017

//Pieter update on Elmor test//

RDSEED not supported was on early 12c. Works on 7900X.

Single Precision - Add/subtract

AVX128 = 1.00x

AVX256 = 1.84x

AVX512 = 3.51x

That's good to see! Full output?

Though I'm seeing rumors that the integer throughput will not be doubled. And I can see architecturally why that might be. Unfortunately I don't have a benchmark for that.

l0ud_sil3nc3 · June 24, 2017

Here is my testing, how many Gflops? All the Gflops

Imgur: The most awesome images on the Internet

Will test the 6c on Monday.

Edited June 24, 2017 by l0ud_sil3nc3

Mysticial · June 24, 2017

Here is my testing, how many Gflops? All the Gflops

Imgur: The most awesome images on the Internet

Will test the 6c on Monday.

Wow! Over 1 TFlops for double-precision! :eek:

CPUz doesn't seem accurate in that screenshot. But based on the numbers it looks like you were clocked around 4.5 GHz? Possibly in 100 x 45 configuration?

And it didn't melt?

l0ud_sil3nc3 · June 24, 2017

Wow! Over 1 TFlops for double-precision!

CPUz doesn't seem accurate in that screenshot. But based on the numbers it looks like you were clocked around 4.5 GHz? Possibly in 100 x 45 configuration?

And it didn't melt?

4.5Ghz (45x100) on the dot, 1.2vcore with a Corsair H110i AIO.

Will try with my good ram on Monday I forgot them at the house and see if it scales at all.

Mesh was at 2.8Ghz for the above screenie.

Edited June 24, 2017 by l0ud_sil3nc3

Mysticial · June 24, 2017

4.5Ghz (45x100) on the dot, 1.2vcore with a Corsair H110i AIO.

Will try with my good ram on Monday I forgot them at the house and see if it scales at all.

Mesh was at 3Ghz for the above screenie.

Ram will have no effect on that benchmark. The benchmark is 100% CPU.

I was able to calculate your clock speed because the benchmark achieves very close to the theoretical FLOPs on the system.

For the Core i9 7900X assuming full-throughput AVX512:

(2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (10 cores) * (4.5 GHz) = 1440 GFlops

The benchmark is showing 1443.84 GFlops. It's actually slightly more than the theoretical limit because of timing variations.

l0ud_sil3nc3 · June 24, 2017

Ram will have no effect on that benchmark. The benchmark is 100% CPU.

I was able to calculate your clock speed because the benchmark achieves very close to the theoretical FLOPs on the system.

For the Core i9 7900X assuming full-throughput AVX512:

(2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (10 cores) * (4.5 GHz) = 1440 GFlops

The benchmark is showing 1443.84 GFlops. It's actually slightly more than the theoretical limit because of timing variations.

You're in luck I'm staying late just popped in the 6c for science bro

Results incoming?

Not sure if there's an 8c laying around here. . . .

l0ud_sil3nc3 · June 24, 2017

6c results 4.5 (45x100) 1.2vcore:

Imgur: The most awesome images on the Internet

No 8c in office atm

Mysticial · June 24, 2017

6c results 4.5 (45x100) 1.2vcore:

Imgur: The most awesome images on the Internet

No 8c in office atm

Now THAT's interesting... They also show full-throughput AVX512. That's contrary to what all the articles out there are reporting.

(2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (6 cores) * (4.5 GHz) = 864 GFlops

Benchmark shows 872.832 GFlops.

If they were only half-throughput, I'd have expected:

(1 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (6 cores) * (4.5 GHz) = 432 GFlops

Thanks for running all these benchmarks!

l0ud_sil3nc3 · September 29, 2017

King of the FLOPS!

Mysticial · September 29, 2017

King of the FLOPS!

WOW... :eek: AVX512 @ 4.5 GHz. How many watts did it pull?

That also confirms full-throughput AVX512 (both FMAs enabled) for the 7980XE.

*I love Intel calls this a "1 teraflop" CPU when it's really 2 - 3 TFLOPs (stock), or 5 TFLOPs (here at 4.5 GHz).

Edited September 29, 2017 by Mysticial

l0ud_sil3nc3 · September 29, 2017

WOW... AVX512 @ 4.5 GHz. How many watts did it pull?

That also confirms full-throughput AVX512 (both FMAs enabled) for the 7980XE.

*I love Intel calls this a "1 teraflop" CPU when it's really 2 - 3 TFLOPs (stock), or 5 TFLOPs (here at 4.5 GHz).

Didn't have the WattsPro hooked up, but this was with 1.25v Vcore. To give you an idea, 1.3v at 4.8 R15 is roughly 800ish watts, and I have seen over 1000w on LN2 pretty easily if you don't optimize Vcore.

This is the Chiron of cpu's

suzuki · September 29, 2017

Many psu will burn with this beast .

Sent from my iPhone using Tapatalk Pro

l0ud_sil3nc3 · September 29, 2017

Many psu will burn with this beast .

Sent from my iPhone using Tapatalk Pro

This setup was utilizing an EVGA 1300w G2, no issues up to 1.35v on water, but I have not tested this particular unit @ cold.

Sign In

A Favor to Ask: Skylake X and AVX512

Recommended Posts

Mysticial

Massman

Mysticial

elmor

Mysticial

Mysticial

elmor

Mysticial

l0ud_sil3nc3

Gunslinger

l0ud_sil3nc3

Massman

Mysticial

l0ud_sil3nc3

Mysticial

l0ud_sil3nc3

Mysticial

l0ud_sil3nc3

l0ud_sil3nc3

Mysticial

l0ud_sil3nc3

Mysticial

l0ud_sil3nc3

suzuki

l0ud_sil3nc3

Join the conversation

Browse

Activity