moog Posted October 18, 2016 Posted October 18, 2016 (edited) Hello, let's suppose that we have a benchmark for endian swaps and popcounts. Intro: Let us have any 32bit integer, a 386 CPU, 486 CPU and Intel Nehalem CPU. Procedure: Endian swap: let x := 0xaabbccdd; expected result : 0xddccbbaa; Population count: for any 32bit integer x, the expected result is number of bits set to 1 in given x My query arises from the fact that each of the 3 specified example CPUs can achieve these results in different ways, one better than the other. 386: 00000000 <__bswap_32>: 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 8b 45 08 mov 0x8(%ebp),%eax 6: 66 c1 c0 08 rol $0x8,%ax a: c1 c0 10 rol $0x10,%eax d: 66 c1 c0 08 rol $0x8,%ax 11: 5d pop %ebp 12: c3 ret 00000013 <_mm_popcnt_u32>: 13: 55 push %ebp 14: 89 e5 mov %esp,%ebp 16: 83 ec 10 sub $0x10,%esp 19: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%ebp) 20: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp) 27: eb 19 jmp 42 <_mm_popcnt_u32+0x2f> 29: 8b 45 fc mov -0x4(%ebp),%eax 2c: 8b 55 08 mov 0x8(%ebp),%edx 2f: 88 c1 mov %al,%cl 31: d3 fa sar %cl,%edx 33: 89 d0 mov %edx,%eax 35: 83 e0 01 and $0x1,%eax 38: 85 c0 test %eax,%eax 3a: 74 03 je 3f <_mm_popcnt_u32+0x2c> 3c: ff 45 f8 incl -0x8(%ebp) 3f: ff 45 fc incl -0x4(%ebp) 42: 83 7d fc 1f cmpl $0x1f,-0x4(%ebp) 46: 7e e1 jle 29 <_mm_popcnt_u32+0x16> 48: 8b 45 f8 mov -0x8(%ebp),%eax 4b: c9 leave 4c: c3 ret 00000055 <popcnt>: 55: 55 push %ebp 56: 89 e5 mov %esp,%ebp 58: ff 75 08 pushl 0x8(%ebp) 5b: e8 aa ff ff ff call a <_mm_popcnt_u32> 60: 83 c4 04 add $0x4,%esp 63: c9 leave 64: c3 ret 486 has a dedicated instruction for byteswaps: 00000000 <__bswap_32>: 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 8b 45 08 mov 0x8(%ebp),%eax 6: 0f c8 bswap %eax 8: 5d pop %ebp 9: c3 ret Nehalem has a dedicated instruction for popcounts: 0000001b <popcnt>: 1b: 55 push %ebp 1c: 89 e5 mov %esp,%ebp 1e: 83 ec 10 sub $0x10,%esp 21: 8b 45 08 mov 0x8(%ebp),%eax 24: 89 45 fc mov %eax,-0x4(%ebp) 27: f3 0f b8 45 fc popcnt -0x4(%ebp),%eax 2c: 90 nop 2d: c9 leave 2e: c3 ret So, is a fair optimisation something that can exploit the additional instructions of the hardware? Is a benchmark fair if 2 CPUs with different capabilities run the same code to achieve the same results, or is the benchmark still fair if one CPU uses better code, because it actually can run it, to achieve the same result faster? @update: Sorry, I think I asked this in the wrong section. Edited October 18, 2016 by moog Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.