@ljs I like this idea of maintaining your documentation
@ljs munmap() being able to fail with -ENOMEM is also kind of a joke
@ljs ooh, I didn't know that madvise_vma_behavior has code to explicitly transmute -ENOMEM (including due to VMA splitting hitting the max vma count) into -EAGAIN, that's... funny
@ljs only madvise_populate() seems to actually return -ENOMEM for -ENOMEM?
@ljs manpage says "EAGAIN A kernel resource was temporarily unavailable", that's very optimistic in the split_vma() case
@ljs I think Linus has said before that the rule is not that you can't change uapi, the rule is you can't change uapi in a way that will actually break real userspace programs, or something like that? I don't remember his exact phrasing.
@ezhes_ @ljs no, super page splitting always works, Linux keeps preallocated reserve page tables for that around.
The issue is that when you call munmap() on the middle of a VMA, the VMA object has to be temporarily split into three (and then the middle VMA gets removed and two remain), and this requires allocating new VMA objects and also, more relevantly, that you have not already hit the maximum limit on number of VMAs a process is allowed to have.
You might think calling munmap() on the middle of a VMA is weird, but actually you almost can't avoid it because Linux deliberately merges adjacent VMAs when possible. So if you call mmap() three times with sufficiently matching arguments, and then you call munmap() on the second allocation, that is likely to actually end up splitting a VMA.
@ezhes_ @ljs (well, to be exact, there are two ways to get high-order userspace page mappings on Linux - I was describing the behavior of THP. hugetlb mappings are different, they're esoteric weird stuff for fancy big database software and have lots of special separate codepaths, and iirc they just don't allow you to split mappings within high-order pages at all.)
@ezhes_ @ljs (to illustrate how weird hugetlb is: that thing actually has code for sharing _page tables_ between processes, like, you can have the same L2 page table pointed to by the L3 page tables of two different processes or something like that, so that you need less memory for page tables of multiple processes mapping the same hugetlb file or something, idk. AFAIK nothing else in Linux does that with page tables for userspace memory)
@ljs noticed the very same thing in the same manual, except for the EINVAL ("oh common, *what* in particular did i get wrong. alignment? my pages are not even 16k!")
i need to take a break from fedi, because obviously our manual readings and menstrual cycles have synchronized
that's round about it, yes. One of the quotes on https://docs.kernel.org/process/handling-regressions.html might one quote from Linus about it.
@jann @ezhes_ @ljs Some people really want this e.g. for infallible destructors. It does kind of make sense from a holistic OS design to always be able to tear things down without the risk of errors (assuming you're using the interface correctly in the first place, of course). The question is just whether you're in a state to continue anyway at that point. I don't really have a concrete security angle to justify it with either.