Conversation

So I just pushed a kernel fix for Asahi Linux to (hopefully) fix random kernel panics.

The fix? Increase kernel stacks to 32K.

We were running out of stack. It turns out that when you have zram enabled and are running out of physical RAM, a memory allocation can trigger a ridiculous call-chain through zram and back into the allocator. This, combined with one or two large-ish stack frames in our GPU driver (2-3K), was simply overflowing the kernel stack.

Here's the thing though: If we were hitting this with simple GPU stuff (which, yes, has a few large stack frames because Rust, but it's a shallow call stack and all it's doing is a regular memory allocation to trigger the rest all the way into the overflow) I guarantee there are kernel call paths that would also run out of stack, today, in upstream kernels with zram (i.e. vanilla Fedora setups).

I'm honestly baffled that, in this day and age, 1) people still think 16K is acceptable, and 2) we still haven't figured out dynamically sized Linux kernel stacks. If we're so close to the edge that a couple KB of extra stack from Rust nonsense causes kernel panics, you're definitely going over the edge with long-tail corner cases of complex subsystem layering already and people's machines are definitely crashing already, just perhaps less often.

I know there was talk of dynamic kernel stacks recently, and one of the issues was that implementing it is hard on x86 due to a series of bad decisions made many years ago including the x86 double-fault model and the fact that in x86 the CPU implicitly uses the stack on faults. Of course, none of this is a problem for ARM64, so maybe we should just implement it here first and let the x86 people figure something out for their architecture on their own ;).

But on the other hand, why not increase stacks to 32K? ARM64 got bumped to 16K in 2013, over 10 years ago. Minimum RAM size has at least doubled since then, so it stands to reason that doubling the kernel stack size is entirely acceptable. Consider a typical GUI app with ~30 threads: With 32K stacks, that's less than 1MB of RAM, and any random GUI app is already going to use many times more than that in graphics surfaces.

Of course, the hyperscalers will complain because they run services that spawn a billion threads (hi Java) and they like to multiply the RAM usage increase by the size of their fleet to justify their opinions (even though all of this is inherently relative anyway). But the hyperscalers are running custom kernels anyway, so they can crank the size down to 16K if they really want to (or 8K, I heard Google still uses that).

9
3
0

@zanagb We aren't upstreaming this one lol, I don't have time to fight this particular fight. This is going into our (increasingly growing) pile of "things upstream doesn't like or will take a few more years to bikeshed and we have better things to do".

0
0
0

@nzgray It detects it alright, but it panics. And due to the way the X86 architecture is documented and the way the kernel works, it turns out it's actually hard to impossible to recover from this condition legitimately. Someone tried to implement it and got told "this is undefined behavior, sorry" (even though presumably the patch worked at least most of the time, but Intel and AMD won't promise it does).

On ARM64 it's no problem, of course, and you could safely implement dynamic stack sizes.

0
0
0

@marcan is kernel memory on linux pageable? ideally it could just dump all of stacks to disk if it doesn't need them right now

1
0
0

@cb I'm not sure if Linux can page out *whole* kernel stacks, but it definitely can't page out only part of them for the aforementioned reasons (it's the same thing as dynamic stacks, you can't recover from the fault once you hit the unmapped page on x86).

1
0
0

@marcan unfortunate. FWIW, i do know NT is very aggressive with making kernel memory pageable, i wonder if it has pageable kernel stacks, and if it does, how the hell is it doing it on x86

1
0
0

@cb I looked it up and NT can page out whole stacks, but not parts as far as I can tell.

NT, however, allows drivers to *dynamically* (explicitly) grow the stack which would be lovely to have on Linux, e.g. doing it at known points of complexity like zram or GPU drivers. But no such mechanism exists (yet).

1
0
0

@marcan Hmm, would there be a way to increase the size of only a single kernel stack, or can the GPU driver and zram stuff be invoked on any kernel stack ?

1
0
0

@Sobex zram can be invoked on any kernel allocation, and the GPU driver can be used by any userspace thread, so not really.

If Linux had dynamic stacks like Windows then we could request a grow ahead of time on known expensive codepaths, but it doesn't.

0
0
0

@marcan look @ljs it's the memory management again

1
0
0
@lkundrak I have this guy blocked bro :) life's too short to deal with people like him.
1
0
1

@ljs haha, can see why

but i get it that with the constant negativity he's also a honorific slav

1
0
3
@lkundrak nah, because he's a 'victim' in a way that is very non-Slavic imo. Like 'I'm such a victim anyway remember that time I was a total tech genius lol worship me lol'.
1
0
3

@ljs lol no, sir, you got that wrong
`we're the victims because everybody else is stupid' is precisely how we live our lives everyday

1
1
4

@ljs i can say, because i'm a honorific slow

1
0
4
@lkundrak I mean you're from Slowvakia so aren't you more than honorary?

I am an honorary one (well up to you I want you to confirm of course) I hope because I am so incredibly negative but also full of hate and very dark humour
1
0
2
@lkundrak Also I worship Satan 🍷
1
1
5

@ljs praise satan

🍷

king of slovakia

1
1
3
@marcan Dynamic kernel stacks were on the agenda at LSFMM+BPF a couple of weeks ago: https://lwn.net/SubscriberLink/974367/f5ed1f0f9a7b5f88/
1
1
8

@marcan you should be able to get a compile-time warning for functions that use excessive stack frames by setting CONFIG_FRAME_WARN to a lower value. The default for arm64 is 2048 bytes, but around 1300 is probably a better cut-off to see the worst offenders without too much output overall.

It looks like we are missing a warning flag for the rust compiler, which I would have expected to complain about a >2K stack. I tried passing -Cllvm-args=-fwarn-stack-size=2048, but that doesn't work.

1
0
1

@marcan This kind of pushes a finger into the wound of OS complexities. The wide range of device scales that run the linux kernel demand not just one and only one kernel stack size. Yet one thing is to have different compiled kernels configured with different stack sizes. A very different thing: to enable adjustable sizes within one and the same running kernel.

0
0
0

@marcan I'd rather see a dynamic system like that than force all Linux systems to have 32K kernel stacks everywhere.

The systems I work with are 256 MB and 512 MB total, and I'm very proud to have working desktops on them. Not to mention the 128 MB VMs I use for serving stuff because memory usage really can be that light.

1
0
0

@awilfox If you are so inclined, changing the stack size is one number in the kernel (even if it's not a config option, which it should be). Niche use cases demanding special kernel configs is normal. The *defaults* should cater to maximizing compatibility, robustness, and targeting "normal" systems with >=2G of RAM. If you really want tiny stacks, then compiling your own kernel isn't a big ask.

A dynamic system would be better of course, but we're talking a giant bikeshed to happen on LKML vs. changing one number in a .h file (and/or adding a Kconfig option for it). The latter is evidently the more practical short term solution.

0
0
0

@marcan How many threads does the Linux kernel spawn typically? Last time I attached a debugger straight after boot it was 900 (I think ZFS was the largest single consumer). The ones for NFS have larger stacks because NFS has some deep calls.

With that many, adding 4 KiB adds almost 4 MiB of wired memory. On a typical desktop, that’s noise and no one cares. For embedded systems (consumer routers and so on) it’s much more of a problem and that’s where the pushback came from last time someone suggested bumping the default.

For the threads associated with userpace threads, it’s different. A userspace thread will have at least a page of userspace stack, a kernel thread structure, and typically a page table page for the stack and its guard page (unless they’re very densely packed), so the memory overhead of a new thread is quite small. If you have a modern x86 machine with AVX-512, you have around 3 KiB just for the CPU state on context switch (kernel threads don’t have FPU state unless they opt in, which most don’t).

Java VMs implement N:M threading, so don’t typically create a lot of kernel threads for a lot of Java threads. The same is true of Go.

The NT kernel was designed at a time when a workstation might have only 4 KiB and so wires just enough disk driver to be able to pull pages back in. All of the metadata required find a page is stored in the invalid PTE. This means that page-table pages can also be paged out, with each step on the page-table walk faulting and bringing back more of the page table until the real page is loaded. Linux and FreeBSD both store extra metadata for paged memory, which is why it’s fairly easy to support things like CHERI and MTE, whereas on Windows it requires significant reachitecture of the virtual memory subsystem. The NT VM subsystem is slightly larger, in lines of code, than a minimal build of the Linux kernel. I completely understand why the NT choices made sense in the early ‘90s but I would not encourage anyone to copy them. Needing a few MiBs more wired memory in exchange for a drastically simpler and more flexible virtual memory model is absolutely the right trade this century.

1
0
0

@david_chisnall

I don't know where you got the 900 number from. I have 198 on my MacBook with a *ton* of junk you won't find on a consumer router. I expect <100 for a typical embedded use case.

Pushback from embedded systems seems silly. Those all use custom kernels anyway. They can compile with smaller stacks...

1
0
0
@marcan 2-3K for a stack frame sounds kind of insane, what could possibly be taking all that space?
1
0
0

@saagar Heavy inlining plus Rust generally putting everything on the stack unless you specify otherwise. We *already* have fairly deep voodoo magic because some GPU firmware data structures are 32K+ and they literally could not be constructed in Rust the normal way without overflowing the stack immediately.

Visibility into stack frame bloat is limited, so it's hard to figure out where most of it comes from. It could be something as simple as an object with a fixed-size 32-array of nontrivial objects getting copied around instead of heap-allocated, or just too much inlining with LLVM being too dumb to reuse stack frame space across inlined functions. It would be nice to have better tools to understand this.

But really, in any language, it's not terribly hard to come up with an object that is a kb or two if you nest enough complex structures. Even in C you can accidentally end up doing that.

It's not dumb buffers, we know not to put those on the stack :P

0
0
0

@arnd The bigger question is how do you figure out *what* is bloating the stack. But yeah, I don't even know if there is a working warning flag for Rust.

My point still stands though, if being a few kB deep into the stack and doing a GFP_KERNEL alloc panics, we have bigger problems lurking than our Rust stacks. I'm pretty sure I could repro this entirely upstream sans Rust with the right setup. The Rust side might have been a few chunky frames, but there were dozens of frames above that as the kernel took a detour into the allocator, through zram, back into the allocator, and into a stack overflow panic.

2
0
0

@corbet Yup, I'm aware, that's part of what prompted this post (that amusingly we're just now running into this ourselves).

0
0
1

@marcan I did some simple analysis earlier this year, playing around with a patch to make the stack artificially smaller at runtime until it crashed, and then running various workloads. The common theme here was clearly getting into memory reclaim from a deep call chain. One idea I had was to change the slab allocator so it would do the reclaim in a separate thread with a fresh stack, but I did not investigate further at that point. Maybe @vbabka has some other ideas here.

1
0
0

@marcan the variable stack allocation does seem promising, and I had similar ideas in the past but never implemented those. What is particularly nice about them is that on kernels with 4KB pages, you can almost always get away with one or two pages instead of four then, so it saves physical memory at the same time as allowing larger stacks. When using 16KB page size, it obviously won't save anything.

0
0
0
@arnd @marcan slab allocator is only a part of reclaim, through generic shrinkers. We already have kswapd as the asynchronous thread that should ideally keep up, but no guarantees. I've been in the lsf/mm discussion so no new great ideas now ;) limiting growing stacks feature to amd64/other sane archs and FRED x86 in the future could be an option. Manual expansion done before direct reclaim perhaps too, and pushing back against random drivers thinking they are special and need it too, so it doesn't proliferate everywhere and defeat the purpose.
1
1
4

@marcan The 900 was from breaking in the kernel debugger and listing threads. This was, I think, a 12.1 kernel. Not sure how it changed in 13 or 14.

1
0
0

@david_chisnall 12.1 is not a kernel version. I'm really confused now.

1
0
0

@vbabka right, it's not slab but __alloc_pages_direct_reclaim() that is in most call chains, e.g. https://pastebin.com/raw/KZWvmhNB for a typical syzkaller report with reduced stack.
The question is whether we can force this to take an asynchronous path (kswapd or something new) all the time to avoid stacking a random fs/blkdev call chain on top of a random kmalloc/alloc_pages/... call.

fw_get_filesystem_firmware() is similarly reponsible for most other stack overruns in syzkaller, followed by nl80211().

1
0
2
@arnd direct reclaim already avoids fs writeback so this is most likely anonymous swapout and thus only io. Perhaps could be sacrificed too on constrained systems.
1
0
0

@vbabka right, only got one backtrace that ends up in ext4 from __alloc_pages_direct_reclaim, but not in writeback: https://pastebin.com/raw/juKwfnBM
Not sure what happens with swap files (instead of partition), would that go through fs code?

I guess we could detect constrained threads in __perform_reclaim() by checking the the amount of free space on the stack, and instead go through queue_work_on(system_unbound_wq, ...); wait_for_completion(); in order to call try_to_free_pages() if it's too low.

1
0
1
@arnd I think (but might be wrong) swap files were pre-fallocated and "locked" in place and we somehow get the respective underlying blocks directly thus skip the FS layer.
0
0
0