Jump to content
HWBOT Community Forums

eachus

Members
  • Posts

    1
  • Joined

  • Last visited

About eachus

  • Birthday 01/11/1946

Converted

  • Location
    Nashua, NH, USA

Converted

  • Interests
    I have a degree in Operations Research, which means I like real messy mathematics.

Converted

  • Occupation
    retired Software Engineer

Converted

  • realname
    Robert Iredell Eachus

eachus's Achievements

Newbie

Newbie (1/14)

10

Reputation

  1. Just to be clear, AMD supplies the CPU BIOS to the motherboard manufacturers, who build it into their motherboards. So the fix may be waiting on validation, but it is the validation at the mobo maker, and different mobo makers will send out their fix at different times. However, don't worry that much about working around it. AFAIK no code exists that does real work and runs into this bug. It may be possible to come up with some computational fluid dynamics (CFD) code that runs into the problem. But linear algebra code (matrix multiplication, eigenvalues, inverses, etc.) that actually does real work writes the results to memory rather than overwriting it like FLOPS does. You can, in theory have a long sequence of FMA3 instructions that only touch L1 cache, but in practice you will have cache misses.* Even if these are caught by L2, that should give the CPU a break. Is it likely that code you write will hit this problem? Highly unlikely, you need two threads on the same CPU pounding away, or one instruction stream that contains FMA3 instructions 256 or 512 bits wide. Oh, and remember you need to get all that loop cruft into one clock cycle: two load instructions which increment their indicies, the FMA3, a load that moves the result somewhere, and a conditional jump instruction. Do all that in one clock cycle? More to the point Get all those microOps through the front-end in one clock cycle? I can do it, with both AMD and Intel hardware, but it isn't easy, and every new processor generation I have to check to see which version works right there, or if I need something new. Ryzen can dispatch six integer (including index and move instructions) and four floating-point microOps in one clock, so it is not that hard. But notice that the four floating-point microOps can be taken up by a 256-bit FMA3 instruction. A 512-bit FMA3 takes two clock cycles so lots of integer room to play with--this generation. *Yes, I can write junk code which does run several hundred FMA3 instructions in a row. Real matrix multiplication code splits big matrices into small chunks, and use write through move instructions to write results to avoid cache pollution. You don't want final results or partials that won't be used again for seconds to stay in cache.
×
×
  • Create New...