Hello, let's suppose that we have a benchmark for endian swaps and popcounts.
Intro: Let us have any 32bit integer, a 386 CPU, 486 CPU and Intel Nehalem CPU.
Procedure:
Endian swap:
let x := 0xaabbccdd;
expected result : 0xddccbbaa;
Population count:
for any 32bit integer x, the expected result is number of bits set to 1 in given x
My query arises from the fact that each of the 3 specified example CPUs can achieve these results in different ways, one better than the other.
386:
00000000 <__bswap_32>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 66 c1 c0 08 rol $0x8,%ax
a: c1 c0 10 rol $0x10,%eax
d: 66 c1 c0 08 rol $0x8,%ax
11: 5d pop %ebp
12: c3 ret
00000013 <_mm_popcnt_u32>:
13: 55 push %ebp
14: 89 e5 mov %esp,%ebp
16: 83 ec 10 sub $0x10,%esp
19: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%ebp)
20: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
27: eb 19 jmp 42 <_mm_popcnt_u32+0x2f>
29: 8b 45 fc mov -0x4(%ebp),%eax
2c: 8b 55 08 mov 0x8(%ebp),%edx
2f: 88 c1 mov %al,%cl
31: d3 fa sar %cl,%edx
33: 89 d0 mov %edx,%eax
35: 83 e0 01 and $0x1,%eax
38: 85 c0 test %eax,%eax
3a: 74 03 je 3f <_mm_popcnt_u32+0x2c>
3c: ff 45 f8 incl -0x8(%ebp)
3f: ff 45 fc incl -0x4(%ebp)
42: 83 7d fc 1f cmpl $0x1f,-0x4(%ebp)
46: 7e e1 jle 29 <_mm_popcnt_u32+0x16>
48: 8b 45 f8 mov -0x8(%ebp),%eax
4b: c9 leave
4c: c3 ret
00000055 <popcnt>:
55: 55 push %ebp
56: 89 e5 mov %esp,%ebp
58: ff 75 08 pushl 0x8(%ebp)
5b: e8 aa ff ff ff call a <_mm_popcnt_u32>
60: 83 c4 04 add $0x4,%esp
63: c9 leave
64: c3 ret
486 has a dedicated instruction for byteswaps:
00000000 <__bswap_32>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 0f c8 bswap %eax
8: 5d pop %ebp
9: c3 ret
Nehalem has a dedicated instruction for popcounts:
0000001b <popcnt>:
1b: 55 push %ebp
1c: 89 e5 mov %esp,%ebp
1e: 83 ec 10 sub $0x10,%esp
21: 8b 45 08 mov 0x8(%ebp),%eax
24: 89 45 fc mov %eax,-0x4(%ebp)
27: f3 0f b8 45 fc popcnt -0x4(%ebp),%eax
2c: 90 nop
2d: c9 leave
2e: c3 ret
So, is a fair optimisation something that can exploit the additional instructions of the hardware? Is a benchmark fair if 2 CPUs with different capabilities run the same code to achieve the same results, or is the benchmark still fair if one CPU uses better code, because it actually can run it, to achieve the same result faster?
@update: Sorry, I think I asked this in the wrong section.