Interesting. The highly optimized unrolled implementation of CRC32C on intel is about the same speed as a tight loop around ‘crc32q’. Both do like 30G/s on a 4KiB block with many iterations. Also compared to an older code of the tight loop with one extra instruction (probably causing register dependency with crc32q) is about 11G/s instead.
45d350: ┌─> 48 8b 08 movq (%rax),%rcx
45d353: │ f2 48 0f 38 f1 f1 crc32q %rcx┌%rsi
45d359: │ 48 83 c0 08 addq $0x8┌%rax
45d35d: │ 48 39 d0 cmpq %rdx┌%rax
45d360: └── 75 ee jne 45d350 <crc32c_sse42+0x30>vs:
455000: ┌─> 48 8b 08 movq (%rax),%rcx
455003: │ 89 fe movl %edi┌%esi
455005: │ f2 48 0f 38 f1 f1 crc32q %rcx┌%rsi
45500b: │ 89 f7 movl %esi┌%edi
45500d: │ 48 83 c0 08 addq $0x8┌%rax
455011: │ 48 39 d0 cmpq %rdx┌%rax
455014: └── 75 ea jne 455000 <crc32c_intel+0x20>The optimized linux starts at https://elixir.bootlin.com/linux/latest/source/lib/crc/x86/crc-pclmul-template.S . This also has the AVX versions that could be faster than plain ‘crc32’ instruction.
Once again there's a kernel developer position open in the SUSE Labs Kernel Core team I'm part of! https://suse.wd3.myworkdayjobs.com/en-US/Jobsatsuse/job/Czech-Republic-EMEA/Linux-Kernel-Generalist_71007379
Always remember the register is a trash publication that publishes straight up lies to suit the author's bias.
Their kernel shit is worse than the usual tripe you see in the press (LWN being a massive exception to that of course!)
Good write-up about Linux Kernel Maintainer duties
https://lwn.net/Articles/1007325/