What is considered a fair optimisation?

moog · October 18, 2016

Hello, let's suppose that we have a benchmark for endian swaps and popcounts.

Intro: Let us have any 32bit integer, a 386 CPU, 486 CPU and Intel Nehalem CPU.

Procedure:

Endian swap:

let x := 0xaabbccdd;

expected result : 0xddccbbaa;

Population count:

for any 32bit integer x, the expected result is number of bits set to 1 in given x

My query arises from the fact that each of the 3 specified example CPUs can achieve these results in different ways, one better than the other.

386:

00000000 <__bswap_32>:
  0:   55                      push   %ebp
  1:   89 e5                   mov    %esp,%ebp
  3:   8b 45 08                mov    0x8(%ebp),%eax
  6:   66 c1 c0 08             rol    $0x8,%ax
  a:   c1 c0 10                rol    $0x10,%eax
  d:   66 c1 c0 08             rol    $0x8,%ax
 11:   5d                      pop    %ebp
 12:   c3                      ret    

00000013 <_mm_popcnt_u32>:
 13:   55                      push   %ebp
 14:   89 e5                   mov    %esp,%ebp
 16:   83 ec 10                sub    $0x10,%esp
 19:   c7 45 f8 00 00 00 00    movl   $0x0,-0x8(%ebp)
 20:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%ebp)
 27:   eb 19                   jmp    42 <_mm_popcnt_u32+0x2f>
 29:   8b 45 fc                mov    -0x4(%ebp),%eax
 2c:   8b 55 08                mov    0x8(%ebp),%edx
 2f:   88 c1                   mov    %al,%cl
 31:   d3 fa                   sar    %cl,%edx
 33:   89 d0                   mov    %edx,%eax
 35:   83 e0 01                and    $0x1,%eax
 38:   85 c0                   test   %eax,%eax
 3a:   74 03                   je     3f <_mm_popcnt_u32+0x2c>
 3c:   ff 45 f8                incl   -0x8(%ebp)
 3f:   ff 45 fc                incl   -0x4(%ebp)
 42:   83 7d fc 1f             cmpl   $0x1f,-0x4(%ebp)
 46:   7e e1                   jle    29 <_mm_popcnt_u32+0x16>
 48:   8b 45 f8                mov    -0x8(%ebp),%eax
 4b:   c9                      leave  
 4c:   c3                      ret

00000055 <popcnt>:
 55:   55                      push   %ebp
 56:   89 e5                   mov    %esp,%ebp
 58:   ff 75 08                pushl  0x8(%ebp)
 5b:   e8 aa ff ff ff          call   a <_mm_popcnt_u32>
 60:   83 c4 04                add    $0x4,%esp
 63:   c9                      leave  
 64:   c3                      ret

486 has a dedicated instruction for byteswaps:

00000000 <__bswap_32>:
  0:   55                      push   %ebp
  1:   89 e5                   mov    %esp,%ebp
  3:   8b 45 08                mov    0x8(%ebp),%eax
  6:   0f c8                   bswap  %eax
  8:   5d                      pop    %ebp
  9:   c3                      ret

Nehalem has a dedicated instruction for popcounts:

0000001b <popcnt>:
 1b:   55                      push   %ebp
 1c:   89 e5                   mov    %esp,%ebp
 1e:   83 ec 10                sub    $0x10,%esp
 21:   8b 45 08                mov    0x8(%ebp),%eax
 24:   89 45 fc                mov    %eax,-0x4(%ebp)
 27:   f3 0f b8 45 fc          popcnt -0x4(%ebp),%eax
 2c:   90                      nop
 2d:   c9                      leave  
 2e:   c3                      ret

So, is a fair optimisation something that can exploit the additional instructions of the hardware? Is a benchmark fair if 2 CPUs with different capabilities run the same code to achieve the same results, or is the benchmark still fair if one CPU uses better code, because it actually can run it, to achieve the same result faster?

@update: Sorry, I think I asked this in the wrong section.

Edited October 18, 2016 by moog

Sign In

What is considered a fair optimisation?

Recommended Posts

moog

Link to comment

Share on other sites

Join the conversation

HWBOT

Browse

Activity