Conversation
Edited 6 days ago

fun kernel/compiler interaction that causes some Linux kernel code to have some superfluous instructions on x86-64 and use a bit more stack space than necessary:

Linux instructs the compiler to prefer 8-byte aligned stack frames (instead of the standard 16 bytes), which then also means the compiler has to assume that at the start of each function, the stack is only aligned to 8 bytes; which means if something tries to do a 16-byte-aligned allocation, the compiler has to emit instructions to save the old stack pointer (even if frame pointers are disabled) and align the stack.

And apparently especially in GCC, any nontrivial stack allocation whose address escapes the compiler's analysis is aligned to 16 bytes even if the object actually requires less alignment:

int foo(void *);
struct s1 { unsigned long a; };
struct s2 { unsigned long a; unsigned long b; };
int bar1() {
struct s1 s;
return foo(&s);
}
int bar2() {
struct s2 s;
return foo(&s);
}

compiles to this with GCC trunk with flags -O3 -mpreferred-stack-boundary=3:

bar1:
subq $8, %rsp
movq %rsp, %rdi
call foo
addq $8, %rsp
ret
bar2:
pushq %rbp
movq %rsp, %rbp
andq $-16, %rsp
subq $16, %rsp
movq %rsp, %rdi
call foo
leave
ret

Note that bar1 doesn't do alignment (probably because struct s1 is simple enough to hit some special case?) while bar2 adds instructions to align the object (even though s1 and s2 have the same alignment requirements).

3
0
1

So in bar2, all of these instructions are unnecessary:

        pushq   %rbp
movq %rsp, %rbp
andq $-16, %rsp
[...]
leave
[...]

and a register (RBP) is wasted here

1
0
0

(clang apparently does this better)

1
0
0
@jann huh, do you know why the kernel asks for only 8 byte alignment?
2
0
0

@osandov @jann possibly size thing given our relatively small stacks?

2
0
0

@jann I'm pretty sure s1 is not a special case, but rather GCC upgrades alignment for s2 in anticipation that doing so will allow accessing it via SSE — it's probably better to avoid that heuristic under -mgeneral-regs-only

1
0
0

@osandov my understanding is that normal userspace code benefits from 16-byte stack alignment because some of the larger vector registers (I think XMM registers in particular) want 16-byte aligned memory. see for example https://www.felixcloutier.com/x86/movaps - if the stack is always 16-byte aligned, the compiler will be able to move data between XMM registers and the stack without having to realign the stack all the time.

The kernel on x86-64 deliberately does not use floating-point/XMM/... registers in normal kernel code because such registers are not saved/restored on syscall entry/exit; so the kernel just has no need for 16-byte stack alignment.

0
0
1

@ljs @osandov @jann SSE is disabled in the kernel, so 16-byte alignment is pointless

0
0
0

@ljs @osandov @jann I believe the reason for 16 byte alignment is to allow for the use of aligned SSE instructions. Since the kernel doesn't allow the use of these instructions in kernel mode, it makes sense to lower the required alignment to 8 bytes.

1
0
1

@jann nasty. Apparently clang has an issue where the tail call optimization fails if there are frame pointers inflating the performance optimization benefit of disabling frame pointers.

1
0
0

@irogers though if a stack allocation's address escapes the analysis, tail call optimization is probably usually impossible anyway?

1
0
0

@jann not sure I'm following as a stack allocation shouldn't outlive the frame in either case. LLVM turns `-fno-omit-frame-pointer` into the same code as if your code has an alloca in it. I suspect the tail call optimization is looking for "call foo; ret" which becomes "jmp foo", and with the "unusual" `-fno-omit-frame-pointer` case this becomes "call foo; pop rbp; ret" (probably more to remove the frame). The compiler could reorder this to be "pop rbp; jmp foo" but is failing to.

1
0
0

@irogers in my examples, the "foo(&s)" calls mean "s" has to stay alive at least until "foo" has finished running, which means it has to stay alive until after the call from bar1/bar2 to foo returns, which means those can't be tail calls. are you talking about some different scenario?

1
0
0

@jann yeah. I was meaning more that the assumption the compiler is doing a good thing isn't necessarily true, as you say on alignment, tail calls being another example 🙂

0
0
0

@jann want me to turn this into a GCC bugreport?

1
0
0

@jann but it already works as desired if you pass -mgeneral-regs-only: https://godbolt.org/z/s6hWq3ea9

1
0
0

@amonakov hmm... x86 Linux doesn't use -fgeneral-regs-only, but it does KBUILD_CFLAGS += -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/Makefile#n77

And it looks like those flags also cause this stack alignment to be disabled. I guess I was just looking at an old kernel build... nevermind.

1
0
0

@jann it is quite unfortunate that it's structured as a denylist, not an allowlist... even today, it already allows APX registers, so presumably someone will scramble to add -mno-apxf to that line after somebody reports an issue

1
0
0

@jann -mgeneral-regs-only is available on x86 starting with gcc-7, so I guess now that the kernel bumped gcc minimum version to 8, it can finally use that

which would fix the APX EGPR issue mentioned above, exposed since CONFIG_X86_NATIVE_CPU made it upstream

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70738

0
0
0