Mysticial Posted June 8, 2017 Posted June 8, 2017 Right now, there are conflicting reports that this first line of Skylake X processors (based on the 10-core Skylake Purley LCC die) will not have full-throughput AVX512. Skylake-X not support AVX-512 instructions Skylake-X i7-7900X Performance Leaked: 55% faster than i7-6950X @ 4.5GHz If this is true, the current Skylake X processors will only be able to run AVX512 at half the speed as the server Xeons - IOW, no better than AVX2. I want to definitively answer this question - both for myself and for anyone else looking to purchase a Skylake X processor for the purpose of AVX512. Using the same FLOPs benchmark that discovered the Ryzen FMA bug, we should be able to find out if Skylake X has full-throughput, or half-throughput AVX512. So my request for someone who has a Skylake X sample* to: Run the "2017-SkylakePurley" binary here: https://github.com/Mysticial/Flops/tree/master/version3/binaries-windows** Do it at a fixed CPU frequency (to avoid the affects of Turbo Boost). Do it with HT enabled. Don't use an extreme overclock. If the chip has full-throughput AVX512, then those AVX512 instructions may produce more heat than any other benchmark you've ever run. Do it with a fully updated Windows 10. Or a recent version of Linux (like Ubuntu 17.04). This is needed to ensure that the OS has support for AVX512. *I may be wrong, but I don't believe Skylake X benchmarks are under NDA anymore since there's already a gazillion HWBOT submissions and you can get access to the server variants on Google Cloud. **The source code is also in that GitHub repo if you want to build it yourself. But be aware that if you need the Intel Compiler if you want to build the AVX512 binaries for Windows. ---------------- When you run the benchmark, I expect one of 3 things to happen: The binary crashes: This means that Windows 10 does not have support for AVX512 and we'll need to wait for that support. The numbers for 512-bit AVX are about the same as the 256-bit AVX: This means that the processor only supports half-throughput AVX512. The numbers for the 512-bit AVX are about 2x as that of the 256-bit AVX: This means that the processor supports full-throughput AVX512. Here is what the benchmark looks like for a 32-core Skylake Purley system on Google Cloud running at 2.0 GHz with 2.5 GHz turbo: Running Skylake Purley tuned binary with 1 thread... Single-Precision - 128-bit AVX - Add/Sub GFlops = 15.904 Result = 2.02376e+06 Double-Precision - 128-bit AVX - Add/Sub GFlops = 7.952 Result = 1.00995e+06 Single-Precision - 128-bit AVX - Multiply GFlops = 15.936 Result = 2.03498e+06 Double-Precision - 128-bit AVX - Multiply GFlops = 7.968 Result = 1.00712e+06 Single-Precision - 128-bit AVX - Multiply + Add GFlops = 15.936 Result = 1.69085e+06 Double-Precision - 128-bit AVX - Multiply + Add GFlops = 7.968 Result = 841756 Single-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 31.872 Result = 2.02868e+06 Double-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 15.936 Result = 1.01782e+06 Single-Precision - 256-bit AVX - Add/Sub GFlops = 31.808 Result = 4.06688e+06 Double-Precision - 256-bit AVX - Add/Sub GFlops = 15.936 Result = 2.02901e+06 Single-Precision - 256-bit AVX - Multiply GFlops = 31.872 Result = 4.06158e+06 Double-Precision - 256-bit AVX - Multiply GFlops = 15.936 Result = 2.02013e+06 Single-Precision - 256-bit AVX - Multiply + Add GFlops = 31.872 Result = 3.34696e+06 Double-Precision - 256-bit AVX - Multiply + Add GFlops = 15.936 Result = 1.70441e+06 Single-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 63.744 Result = 4.0399e+06 Double-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 31.872 Result = 2.00801e+06 Single-Precision - 512-bit AVX512 - Add/Sub GFlops = 63.744 Result = 8.11456e+06 Double-Precision - 512-bit AVX512 - Add/Sub GFlops = 31.872 Result = 4.03949e+06 Single-Precision - 512-bit AVX512 - Multiply GFlops = 63.36 Result = 8.0743e+06 Double-Precision - 512-bit AVX512 - Multiply GFlops = 31.872 Result = 4.05014e+06 Single-Precision - 512-bit AVX512 - Multiply + Add GFlops = 63.744 Result = 6.68723e+06 Double-Precision - 512-bit AVX512 - Multiply + Add GFlops = 31.872 Result = 3.3739e+06 Single-Precision - 512-bit AVX512 - Fused Multiply Add GFlops = 127.488 Result = 8.22848e+06 Double-Precision - 512-bit AVX512 - Fused Multiply Add GFlops = 63.744 Result = 4.03805e+06 Running Skylake Purley tuned binary with 64 thread(s)... Single-Precision - 128-bit AVX - Add/Sub GFlops = 683.36 Result = 8.68179e+07 Double-Precision - 128-bit AVX - Add/Sub GFlops = 263.568 Result = 3.35065e+07 Single-Precision - 128-bit AVX - Multiply GFlops = 527.616 Result = 6.69453e+07 Double-Precision - 128-bit AVX - Multiply GFlops = 263.88 Result = 3.34619e+07 Single-Precision - 128-bit AVX - Multiply + Add GFlops = 527.136 Result = 5.58561e+07 Double-Precision - 128-bit AVX - Multiply + Add GFlops = 263.64 Result = 2.79832e+07 Single-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 1056.77 Result = 6.71142e+07 Double-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 528.336 Result = 3.36188e+07 Single-Precision - 256-bit AVX - Add/Sub GFlops = 1054.14 Result = 1.34076e+08 Double-Precision - 256-bit AVX - Add/Sub GFlops = 527.52 Result = 6.68866e+07 Single-Precision - 256-bit AVX - Multiply GFlops = 1056.77 Result = 1.34416e+08 Double-Precision - 256-bit AVX - Multiply GFlops = 527.664 Result = 6.70251e+07 Single-Precision - 256-bit AVX - Multiply + Add GFlops = 1055.33 Result = 1.12018e+08 Double-Precision - 256-bit AVX - Multiply + Add GFlops = 527.52 Result = 5.59086e+07 Single-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 2110.08 Result = 1.34046e+08 Double-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 1055.33 Result = 6.69451e+07 Single-Precision - 512-bit AVX512 - Add/Sub GFlops = 2112.26 Result = 2.68216e+08 Double-Precision - 512-bit AVX512 - Add/Sub GFlops = 1056 Result = 1.34131e+08 Single-Precision - 512-bit AVX512 - Multiply GFlops = 2117.38 Result = 2.69031e+08 Double-Precision - 512-bit AVX512 - Multiply GFlops = 1059.26 Result = 1.34601e+08 Single-Precision - 512-bit AVX512 - Multiply + Add GFlops = 2118.14 Result = 2.24393e+08 Double-Precision - 512-bit AVX512 - Multiply + Add GFlops = 1058.5 Result = 1.12102e+08 Single-Precision - 512-bit AVX512 - Fused Multiply Add GFlops = 4242.43 Result = 2.69409e+08 Double-Precision - 512-bit AVX512 - Fused Multiply Add GFlops = 2115.07 Result = 1.34365e+08 This Skylake Purley system has full-throughput AVX512. Quote
Mysticial Posted June 19, 2017 Author Posted June 19, 2017 (edited) Bump. NDAs lifting today. I'm most curious about the 7820X and the 7900X. EDIT: The reviews seems to indicate that the 6 and 8-core models will have half-throughput, and the 10-core model will have full-throughput. Microarchitecture Analysis: Adding in AVX-512 and Tweaks to Skylake-S - The Intel Skylake-X Review: Core i9 7900X, i7 7820X and i7 7800X Tested Edited June 19, 2017 by Mysticial Quote
elmor Posted June 21, 2017 Posted June 21, 2017 Windows 10 1703 with Intel C++ redists installed. Quote
Mysticial Posted June 21, 2017 Author Posted June 21, 2017 (edited) Windows 10 1703 with Intel C++ redists installed. Thank you! This is interesting though. The compiler seems to be trying to enforce that the computer has RDSEED instructions. But RDSEED was already available starting from Broadwell. I don't see why it would be missing from Skylake X unless it was explicitly disabled in the BIOS or something. This might be a problem moving forward since the compiler forces these checks even though most programs won't use them anyway. EDIT: Is virtualization disabled in the BIOS? I'm reading around and it seems that some machines have all the crypto instructions disabled (AES-NI, RDRAND, and RDSEED) and it may be related to virtualization. Edited June 21, 2017 by Mysticial Quote
Mysticial Posted June 22, 2017 Author Posted June 22, 2017 I found a way to disable that check by the compiler and I've updated the binaries. So if anyone is willing to try now, it should (hopefully) work regardless of whether RDSEED is enabled or not. Thanks. Quote
elmor Posted June 22, 2017 Posted June 22, 2017 Thank you! This is interesting though. The compiler seems to be trying to enforce that the computer has RDSEED instructions. But RDSEED was already available starting from Broadwell. I don't see why it would be missing from Skylake X unless it was explicitly disabled in the BIOS or something. This might be a problem moving forward since the compiler forces these checks even though most programs won't use them anyway. EDIT: Is virtualization disabled in the BIOS? I'm reading around and it seems that some machines have all the crypto instructions disabled (AES-NI, RDRAND, and RDSEED) and it may be related to virtualization. It was, but still get the same after enabling it. Quote
Mysticial Posted June 22, 2017 Author Posted June 22, 2017 It was, but still get the same after enabling it. Would you be able to try with the latest binaries? I updated them last night. As far as I can tell, I've removed the check. So it should get past that message and either run successfully or crash. Thanks for you time. Quote
l0ud_sil3nc3 Posted June 22, 2017 Posted June 22, 2017 Would you be able to try with the latest binaries? I updated them last night. As far as I can tell, I've removed the check. So it should get past that message and either run successfully or crash. Thanks for you time. Works fine here with prior binaries will test later with latest. Quote
Gunslinger Posted June 22, 2017 Posted June 22, 2017 Works fine here with prior binaries will test later with latest. I don't believe you, send me your X299 gear so I can see first hand. Quote
l0ud_sil3nc3 Posted June 22, 2017 Posted June 22, 2017 I don't believe you, send me your X299 gear so I can see first hand. No probrem Gunny Quote
Massman Posted June 23, 2017 Posted June 23, 2017 (edited) It was, but still get the same after enabling it. //Pieter update on Elmor test// RDSEED not supported was on early 12c. Works on 7900X. Single Precision - Add/subtract AVX128 = 1.00x AVX256 = 1.84x AVX512 = 3.51x Edited June 23, 2017 by Massman Quote
Mysticial Posted June 23, 2017 Author Posted June 23, 2017 //Pieter update on Elmor test// RDSEED not supported was on early 12c. Works on 7900X. Single Precision - Add/subtract AVX128 = 1.00x AVX256 = 1.84x AVX512 = 3.51x That's good to see! Full output? Though I'm seeing rumors that the integer throughput will not be doubled. And I can see architecturally why that might be. Unfortunately I don't have a benchmark for that. Quote
l0ud_sil3nc3 Posted June 24, 2017 Posted June 24, 2017 (edited) Here is my testing, how many Gflops? All the Gflops Imgur: The most awesome images on the Internet Will test the 6c on Monday. Edited June 24, 2017 by l0ud_sil3nc3 Quote
Mysticial Posted June 24, 2017 Author Posted June 24, 2017 Here is my testing, how many Gflops? All the Gflops Imgur: The most awesome images on the Internet Will test the 6c on Monday. Wow! Over 1 TFlops for double-precision! CPUz doesn't seem accurate in that screenshot. But based on the numbers it looks like you were clocked around 4.5 GHz? Possibly in 100 x 45 configuration? And it didn't melt? Quote
l0ud_sil3nc3 Posted June 24, 2017 Posted June 24, 2017 (edited) Wow! Over 1 TFlops for double-precision! CPUz doesn't seem accurate in that screenshot. But based on the numbers it looks like you were clocked around 4.5 GHz? Possibly in 100 x 45 configuration? And it didn't melt? 4.5Ghz (45x100) on the dot, 1.2vcore with a Corsair H110i AIO. Will try with my good ram on Monday I forgot them at the house and see if it scales at all. Mesh was at 2.8Ghz for the above screenie. Edited June 24, 2017 by l0ud_sil3nc3 Quote
Mysticial Posted June 24, 2017 Author Posted June 24, 2017 4.5Ghz (45x100) on the dot, 1.2vcore with a Corsair H110i AIO. Will try with my good ram on Monday I forgot them at the house and see if it scales at all. Mesh was at 3Ghz for the above screenie. Ram will have no effect on that benchmark. The benchmark is 100% CPU. I was able to calculate your clock speed because the benchmark achieves very close to the theoretical FLOPs on the system. For the Core i9 7900X assuming full-throughput AVX512: (2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (10 cores) * (4.5 GHz) = 1440 GFlops The benchmark is showing 1443.84 GFlops. It's actually slightly more than the theoretical limit because of timing variations. Quote
l0ud_sil3nc3 Posted June 24, 2017 Posted June 24, 2017 Ram will have no effect on that benchmark. The benchmark is 100% CPU. I was able to calculate your clock speed because the benchmark achieves very close to the theoretical FLOPs on the system. For the Core i9 7900X assuming full-throughput AVX512: (2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (10 cores) * (4.5 GHz) = 1440 GFlops The benchmark is showing 1443.84 GFlops. It's actually slightly more than the theoretical limit because of timing variations. You're in luck I'm staying late just popped in the 6c for science bro Results incoming? Not sure if there's an 8c laying around here. . . . Quote
l0ud_sil3nc3 Posted June 24, 2017 Posted June 24, 2017 6c results 4.5 (45x100) 1.2vcore: Imgur: The most awesome images on the Internet No 8c in office atm Quote
Mysticial Posted June 24, 2017 Author Posted June 24, 2017 6c results 4.5 (45x100) 1.2vcore: Imgur: The most awesome images on the Internet No 8c in office atm Now THAT's interesting... They also show full-throughput AVX512. That's contrary to what all the articles out there are reporting. (2 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (6 cores) * (4.5 GHz) = 864 GFlops Benchmark shows 872.832 GFlops. If they were only half-throughput, I'd have expected: (1 FMA/cycle for full-throughput AVX512) * (2 Flops/FMA) * (8 DP/instruction for AVX512) * (6 cores) * (4.5 GHz) = 432 GFlops Thanks for running all these benchmarks! Quote
Mysticial Posted September 29, 2017 Author Posted September 29, 2017 (edited) King of the FLOPS! WOW... AVX512 @ 4.5 GHz. How many watts did it pull? That also confirms full-throughput AVX512 (both FMAs enabled) for the 7980XE. *I love Intel calls this a "1 teraflop" CPU when it's really 2 - 3 TFLOPs (stock), or 5 TFLOPs (here at 4.5 GHz). Edited September 29, 2017 by Mysticial Quote
l0ud_sil3nc3 Posted September 29, 2017 Posted September 29, 2017 WOW... AVX512 @ 4.5 GHz. How many watts did it pull? That also confirms full-throughput AVX512 (both FMAs enabled) for the 7980XE. *I love Intel calls this a "1 teraflop" CPU when it's really 2 - 3 TFLOPs (stock), or 5 TFLOPs (here at 4.5 GHz). Didn't have the WattsPro hooked up, but this was with 1.25v Vcore. To give you an idea, 1.3v at 4.8 R15 is roughly 800ish watts, and I have seen over 1000w on LN2 pretty easily if you don't optimize Vcore. This is the Chiron of cpu's Quote
suzuki Posted September 29, 2017 Posted September 29, 2017 Many psu will burn with this beast . Sent from my iPhone using Tapatalk Pro Quote
l0ud_sil3nc3 Posted September 29, 2017 Posted September 29, 2017 Many psu will burn with this beast . Sent from my iPhone using Tapatalk Pro This setup was utilizing an EVGA 1300w G2, no issues up to 1.35v on water, but I have not tested this particular unit @ cold. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.