Conversation

Interesting. The highly optimized unrolled implementation of CRC32C on intel is about the same speed as a tight loop around ‘crc32q’. Both do like 30G/s on a 4KiB block with many iterations. Also compared to an older code of the tight loop with one extra instruction (probably causing register dependency with crc32q) is about 11G/s instead.

45d350:  ┌─> 48 8b 08                     movq   (%rax),%rcx
45d353:  │   f2 48 0f 38 f1 f1            crc32q %rcx┌%rsi
45d359:  │   48 83 c0 08                  addq   $0x8┌%rax
45d35d:  │   48 39 d0                     cmpq   %rdx┌%rax
45d360:  └── 75 ee                        jne    45d350 <crc32c_sse42+0x30>

vs:

455000:  ┌─> 48 8b 08                     movq   (%rax),%rcx
455003:  │   89 fe                        movl   %edi┌%esi
455005:  │   f2 48 0f 38 f1 f1            crc32q %rcx┌%rsi
45500b:  │   89 f7                        movl   %esi┌%edi
45500d:  │   48 83 c0 08                  addq   $0x8┌%rax
455011:  │   48 39 d0                     cmpq   %rdx┌%rax
455014:  └── 75 ea                        jne    455000 <crc32c_intel+0x20>

The optimized linux starts at https://elixir.bootlin.com/linux/latest/source/lib/crc/x86/crc-pclmul-template.S . This also has the AVX versions that could be faster than plain ‘crc32’ instruction.

1
0
0
Yep, no, too good be true. The implementation selection was wrong, always using PCLMUL. Interpretting CPU feature sets as linear "levels" somehow works in this case but the definition ordering was incorrect. Nothing to see here.
0
0
1