Conversation
madvise() is certifiably insane.

Hey you can map over gaps! And the action will be applied but we'll return -ENOMEM!

Also if we run out of memory, we'll also return -ENOMEM!

What do you mean how do you tell the difference? Lol! Bye!
4
1
9
I'd say I'd change this, but it's user-facing so we literally can't.
2
0
3
you can tell this is very sane because the man page has 4 entries for ENOMEM - https://man7.org/linux/man-pages/man2/madvise.2.html

But it's wrong, because any operation that actually runs out of memory and returns -ENOMEM can also be ENOMEM.

We really need to bring man pages into the kernel and re recent posts about maintainership of that, people need to update the man pages as we change things (thus solving the funding problem)
3
2
8

@ljs I like this idea of maintaining your documentation

0
0
1

@ljs munmap() being able to fail with -ENOMEM is also kind of a joke

2
0
1

@ljs ooh, I didn't know that madvise_vma_behavior has code to explicitly transmute -ENOMEM (including due to VMA splitting hitting the max vma count) into -EAGAIN, that's... funny

1
0
0

@ljs only madvise_populate() seems to actually return -ENOMEM for -ENOMEM?

1
0
0

@ljs manpage says "EAGAIN A kernel resource was temporarily unavailable", that's very optimistic in the split_vma() case

0
0
1

@ljs I think Linus has said before that the rule is not that you can't change uapi, the rule is you can't change uapi in a way that will actually break real userspace programs, or something like that? I don't remember his exact phrasing.

2
0
1

@jann @ljs I wonder how you'd actually hit that. Maybe if you munmap in the middle of a super page, you'd have to allocate page tables to make a hole, and then that can then fail due to not having enough memory?

1
0
0

@ljs well it says "mad" right there in the name...

0
0
1

@ezhes_ @ljs no, super page splitting always works, Linux keeps preallocated reserve page tables for that around.
The issue is that when you call munmap() on the middle of a VMA, the VMA object has to be temporarily split into three (and then the middle VMA gets removed and two remain), and this requires allocating new VMA objects and also, more relevantly, that you have not already hit the maximum limit on number of VMAs a process is allowed to have.
You might think calling munmap() on the middle of a VMA is weird, but actually you almost can't avoid it because Linux deliberately merges adjacent VMAs when possible. So if you call mmap() three times with sufficiently matching arguments, and then you call munmap() on the second allocation, that is likely to actually end up splitting a VMA.

4
0
1

@ezhes_ @ljs and you only get something like 65k VMAs per process in the default config, iirc

1
0
0

@ezhes_ @ljs so like, if you call mmap() 100000 times to make 4K-sized anonymous memory allocations, that'll probably work, but if you then try to munmap() every second allocation, you should start seeing -ENOMEM at some point

0
0
0

@ezhes_ @ljs (well, to be exact, there are two ways to get high-order userspace page mappings on Linux - I was describing the behavior of THP. hugetlb mappings are different, they're esoteric weird stuff for fancy big database software and have lots of special separate codepaths, and iirc they just don't allow you to split mappings within high-order pages at all.)

1
0
1

@ezhes_ @ljs (to illustrate how weird hugetlb is: that thing actually has code for sharing _page tables_ between processes, like, you can have the same L2 page table pointed to by the L3 page tables of two different processes or something like that, so that you need less memory for page tables of multiple processes mapping the same hugetlb file or something, idk. AFAIK nothing else in Linux does that with page tables for userspace memory)

0
0
1

le petit printf πŸ‡ΊπŸ‡¦πŸ‡¨πŸ‡ΏπŸ‘ƒπŸ’¨

@ljs noticed the very same thing in the same manual, except for the EINVAL ("oh common, *what* in particular did i get wrong. alignment? my pages are not even 16k!")

i need to take a break from fedi, because obviously our manual readings and menstrual cycles have synchronized

0
0
2

@jann @ljs

that's round about it, yes. One of the quotes on https://docs.kernel.org/process/handling-regressions.html might one quote from Linus about it.

0
0
1
@ljs question ideas for job interviews: describe differences and similarities of MADV_REMOVE and FALLOC_FL_PUNCH_HOLE ;-)
1
0
1
@jarkko yeah that is a sucky interview question, more a 'how the kernel names things inconsistently' cautionary tale
0
0
2

@jann @ezhes_ @ljs I have a patch that adds a new mmap flag to create unmergeable VMAs, the thought being that if it can't be merged you'll always be able to unmap it. I never submitted it though πŸ˜• Couldn't make up my mind about whether it was a good idea in the end.

3
0
2

@jann @ezhes_ @ljs Some people really want this e.g. for infallible destructors. It does kind of make sense from a holistic OS design to always be able to tear things down without the risk of errors (assuming you're using the interface correctly in the first place, of course). The question is just whether you're in a state to continue anyway at that point. I don't really have a concrete security angle to justify it with either.

1
0
2
@jann not really, that's more like 'mm works differently than you expect' and 'extreme memory pressure can result in bad things happening'
0
0
1
@vegard @jann @ezhes_ the situation under which this can happen is under extreme memory pressure, and not cleaning up munmap()'s is the least of your concerns in that situation.

An 'unmergeable' VMA would still have this problem as you could munmap() or mprotect() (or MAP_FIXED mmap() over it) in the middle and split it.
0
0
0
@jann @ezhes_ I mean, I don't think VMAs merging is the only reason you'd unmap in the middle, for instance it's super common to mmap() PROT_NONE a region and then MAP_FIXED mmap() over it which will cause splits.

Liam's recent series changes how unmap is done, but the splitting is still done up front (and, under extreme memory pressure can result in the split remaining! This is the case with current impl too afaict).

However In future there are plans to be able to unwind the split and to preallocate a VMA for it, so you would actually hit the out of memory before you even started leaving no 'mess' behind.

In practice what seems to be more of a problem is _more_ VMAs making vma merge actually super critical especially when you have the map limit, and even more especially when you run Hogwarts Legacy apparently which maps tons and tons of stuff.

A certain somebody may have a talk at LPC about how to reduce VMA counts when using guard pages (as usual - all credit for the idea goes to @vbabka I am the humble implementor) - which helps with this.

But yeah it's fucking weird, not denying that haha
2
0
2
@jann @ezhes_ @vbabka You could argue that we should have memory in reserve for this but keep in mind we already have a bunch of reserves, and for this to happen, they went a bye bye.

Hang on we could use __GFP_NOFAIL! That's pretty popular and uncontroversial... but hm wouldn't a system under a ton of memory pressure potentially be hitting this a lot and ugh fuck this

I'm off to the woods

I'm becoming a goat
0
0
1
@ljs @jann @ezhes_ I think the mmap limit sysctl is the only practical concern as any kernel allocations involved here would be handled by the "too small to fail" rule and either retry until succeeding, or die as an oom kill victim. Which is exactly why we can't abandon this rule (as Linus pointed out at some point) as suddenly userspace could start getting -ENOMEMs where it didn't use to
1
0
2
@vbabka @ezhes_ @jann Even if the allocation failed, it would fail because of such extreme memory pressure that failing to unmap memory at that point in a process would be the least of your concerns.

You can also have an issue with maple tree preallocation failure by the way, and a few other allocations.

There are other ways this can fail too though. I mean trivially, if you are mapping something 'special' that implements vm_ops->may_split() this can cause the split to fail.

Also new mseal (gawd) can make it fail... uffd does some 'unmap prep' thing that can die too.

If you're mmap()'ing with MAP_FIXED over an existing region, which implies an unmap, you have a hook call_mmap() which can end up doing something that breaks.

Prior to Liam's changes, if a MAP_FIXED mmap() failed, you'd just clear down the region. But now, it tries to actually unwind the operation and _nothing is unmapped_ until you know you can definitely do it.

(Motivation for this is to later be able to do this stuff under RCU and for nobody to briefly see an 'impossible' situation which would be rather confusing).

HOWEVER, if there's some insane low memory availability scenario which prevents the unwinding due to inability to preallocate maple tree nodes for instance, we still do clear down.

Generally it's all a bit of a minefield...
1
0
2
@vbabka @ezhes_ @jann But TL; DR agreed the most likely thing you'll ever hit there is the map limit
0
0
1
@vegard @jann @ezhes_ @ljs wouldn't these days be sufficient to... mseal() it?
/me hides
1
0
1
@vbabka @vegard @ezhes_ @jann yeah I was going to say, but like, I didn't want to say.

But mseal() doesn't prevent split anyway, it will prevent merge unless with other mseal VMAs.

Unfortunately let's say there's some 'confusion' about mseal
0
0
1

@vegard @ezhes_ @ljs that would probably also make you run out of vmas faster though

1
0
1

@vegard @ezhes_ @ljs what does "correctness" mean here? If your userspace code treats running out of VMAs or being unable to munmap() as fatal conditions that cause process exit, then not merging VMAs will just make your process die earlier.

1
0
0
@jann @vegard @ezhes_ also unmergeability doesn't prevent you from needing to allocate to unmap...

You would need it to be 'unsplittable' then and I don't see what the use is, you're sort of abusing mm to solve a problem that isn't really a problem.

Odds of allocation failure happening in anything but totally extreme circumtances where not munmap()ing will be the least of your problems is basically zero.

Obviously you might hit VMA limit, but that's also a totally broken state and inventing a new VMA flag to avoid that doesn't make any sense to me.

So yeah I may be missing something but this doesn't make any sense to me :(
0
0
0