Jeremy Allison writes:
'" The data shows that “frozen” vendor #Linux kernels, created by branching off a release point and then using a team of engineers to select specific patches to back-port to that branch, are buggier than the upstream “stable” Linux #kernel created by Greg Kroah-Hartman. '"
https://ciq.com/blog/why-a-frozen-linux-kernel-isnt-the-safest-choice-for-security/ #LinuxKernel
@kernellogger On the other hand, what's the difference between a distro branching off and backporting stuff from mainline and upstream stable branching off and backporting stuff from mainline? Why can the upstream stable maintainers do this and "a team of engineers" cannot? I think the difference could be better characterized and (if you pardon the expression) makes all the difference.
@kernellogger as usual, the point is not that these are bug free, but that they are regression free. The kernel upstream releases break userspace on every new release, and kernel maintainers don't care. See https://github.com/torvalds/linux/commit/a1912f712188291f9d7d434fba155461f1ebef66 for example, as Daan just found out, which removed a mount option without caring that it is still being used, so since 6.8 every btrfs device can no longer be mounted by systemd
@gregkh @kernellogger @vegard
If I am reading the whitepaper correctly, the count of "missing fixes" includes fixes for subsystems that are disabled and not shipped in the RHEL kernel, of which there are many:
"In addition, some of these bugs may be in code paths that are disabled via kernel config file settings. No analysis has been done on which bugs may be enabled or disabled for a specific vendor kernel config."
Not restricting the analysis to shipped code makes it very hard to take the paper seriously.
In filesystems alone, there are over 1500 upstream "Fixes" commits for filesystems which are not shipped in RHEL8.8.
That's fully 1/3 of the 4594 "unfixed bugs" they cite.
Am I missing something?
@kernellogger That's to be expected, but it is also not the point of them.
I agree they shouldn't need to exist, but the realities of how many many an organization manages their IT necessitates their existence.
The industry doesn't want to go through the withdrawal phase of building a better world.
@sandeen @gregkh @kernellogger @vegard haven't read the paper yet.
A few things: RHEL kernel developers do not backport all security fixes. We do not mandate backporting moderates and lower (unless they are FedRamp) because customers pay for EUS (which 8.8 is) for stability reasons and backporting certain security fixes may affect stability for this limited term release.
@sandeen @gregkh @kernellogger @vegard
Also "In addition, some of these bugs may be in code paths that are disabled via kernel config file settings. No analysis has been done on which bugs may be enabled or disabled for a specific vendor kernel config."
This is a big gap. There are a lot of things we do not support. Like btfs that is disabled in RHEL.
@ljs @kernellogger @sandeen @vegard @gregkh Regressions is a good point. I just had to report an rt networking issue we found in RHEL8 that was also in upstream. It was verified, a fix was written for preempt-rt, and Linus even pulled it in an rc. This fix was cleanly cherry-picked back to RHEL9 and 8. Our backports can be so good that our testing on 5+ year old RHEL8 finds issues in the upstream kernel.
@ljs @kernellogger @sandeen @vegard @gregkh this also while maintaining a limited kABI guarantee that our customers need for hardware and software compatibility. (Don't get me wrong, I would be very happy if all that was upstreamed.) This is one of the reasons why customers prefer and pay for our kernels.
@gregkh @kernellogger @vegard As an example, I'm looking into the NXP SDK for their QorIQ Layerscape SoCs. Their released Yocto based system is based on an old revision (3-4 revisions out of date), the latest the seem to have in their public git isn't much better, it's based on a version that gets EOLed this month.
Kernel wise, the latest trees I've found are based on the v6.1.y stable tree. But the latest version merged in is v6.1.55. I believe the upstream kernel tree is currently on V6.1.91.
@gregkh @kernellogger @vegard Actually, that not quite right. The latest kernel tree they have is based on v 6.6.y, however that's not in one of their releases and also isn't up to date with stable releases.
Well, to claim "kernel maintainers don't care" you have to at least report the bug to them[1]. That afaics has not happened yet (or I could not find it).
"since 6.8 every btrfs device can no longer be mounted by systemd": then why was this only noticed 2+ months after a release with that commit went out? This raises the question: what kind of problem did users actually run into?
[1] yes, sure, ideally they would have done a code search first, but we are all imperfect…
@bluca @kernellogger It has been deprecated for three years according to the commit message?
what is considered "deprecated" by the developers afaics does not matter much when it comes to Linus' interpretation of the Linux kernels "no regressions" rule.
At the same time there is neither a stable API or ABI; so things are free to change (like in case of the culprit), as long as nothing breaks.
reg. the "distros want no-regressions, not no-bugs":
from my point of view the whole situation could be a lot better if distros would spend some of the money they currently invest in CI instead invest in working on workflow improvements and some others stuff to ensure regressions do not happen in the first place or are quickly resolved.
@kernellogger @pavel I’d be interested in knowing how you would improve the workflows. What’s missing, what can be improved and what shouldn’t be done. I would love to help with this however I can. :)
@kernellogger well, the kernel doesn't have a bug tracker - not for real anyway, bugzilla.kernel.org might as well be pointed to /dev/null, so no idea what "reporting" would even mean in this case. I do not use BTRFS so I am not affected, just sharing what was reported to me. It looks like it was reported against the Debian kernel package too now: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071420
reg. bug reporting:
https://docs.kernel.org/admin-guide/reporting-issues.html
https://docs.kernel.org/admin-guide/reporting-regressions.html
Some of it does not apply in this case.
I also make sure to handle regressions that are submitted to bugzilla.kernel.org
@kernellogger@fosstodon.org @bluca@fosstodon.org but, following upstream, passthrough patches, and maintain downstream compatibility, is the duty of a middleware, like systemd, from kernel to user interface, no?
and thx for the link to the debian bug tracker; but I want to see more details first what went wrong there, as I'd expect it would be pretty unlikely that this is the first debian btrfs user that updated to 6.8 or higher; so why did it break for that user, but apparently not for the others?
@kernellogger 6.8 has just arrived in Debian unstable 3 days ago: https://tracker.debian.org/pkg/linux
@vbabka @bluca @kernellogger I've just updated my Arch installation with (encrypted) btrfs root. Seems to be no problem.
@d4nuu8 @vbabka @kernellogger I have no clue, I don't use BTRFS, I just get bug reports
There is no easy answer here, as it are lots of details; but there is a decent chance I need to write this up soon anyway; if I do, I'll get back to you!
@kernellogger @bluca I can confirm 6.9-rc5 is running just fine for me with openSUSE and a btrfs root filesystem on my main laptop so it looks like this may be specific to something Debian did.
@kernellogger I'm afraid I can't support the counting methodology in the paper either. Besides the not applicable because of config issues RH people cite, there's also the fact that not everything that has a cc: stable tag is an exploitable bug. Plus every fix backported carries risk (just look at the number of regressions in stable due to backports) so that risk has to be set against the benefit of the backport. A general rule would be if it's not exploitable don't backport it.
@kernellogger @DanielMicay always warned against using frozen or LTS releases because of missing critical security patches, but it seems most people are still ignorant. 🙃
@triskelion @kernellogger I didn't warn against using the upstream LTS branches although the older ones do get much less backported.
@kernellogger this is now being reverted, fortunately: https://lore.kernel.org/all/44c367eab0f3fbac9567f40da7b274f2125346f3.1716285322.git.wqu@suse.com/
thx, yeah, I already have been watching that.
1/ FWIW, I think you owe the kernel developers an apology, as you made a lot of noise and claimed "kernel maintainers don't care", when they clearly do once the problem was properly reported -- and quite quickly even. And yes, sure, in the ideal world they would have cared some more and performed a code-search before removing this option to prevent it in the first place. But we are all imperfect and make mistakes. Same for @pid_eins, who…
2/ …wrote "And my main beef here is that they claim they wouldnt do it ever..."[1], as that is not even true. They often try changes or removals to see if it breaks something – and if it does, it's reverted. Even the removal of the support for the original i386 was handled like that by Linus himself.
@kernellogger @bluca sure, but then the rule is not "we never break userspace" but more "move fast and break things, and sometimes revert where people protest too loudly".
I mean, that's fine by me, but maybe they should communicate it like that then.
The thing is that removing a widely documented mount option is very *obviously* a compat breakage. You cannot discount that. It's not just a "mistake" to remove something like that, it's an *obvious* attempt to break compat.
@pid_eins @kernellogger @bluca yeah in graphics we go with a 10 year delay for the obvious compat breakages
so either wait 10 years after the last known user was updated to the new interfaces (where we know of them, which is the usual case since it's all open source)
or 10 years after the replacement shipped for more script interfaces like some of the stuff in sysfs
@pid_eins @kernellogger @bluca 10 years seems to be enough where the only people you would end up breaking are those who don't upgrade kernels anyway, ever
@pid_eins @kernellogger @bluca of course there have been screw-ups and misses. but when those happen we try to put the references to the relevant userspace we broke into the reverts, so that people can start the 10 year clock at the right time
Actually, the exact relevant rule is "WE DO NOT BREAK USERSPACE", all in uppercase.
https://lkml.org/lkml/2012/12/23/75
I find the sound of that mail quite different from your much weaker "let's maybe undo the worst shit if people complain too loudly"... And of course "uh, sometimes we fucked up so hard, we cannot fix it anymore, let's add a new api instead" (which is what happened in the block device capabilities/media change api).
(again, I actually find it OK if API is broken from time to time, just be honest about it, and communicate properly, and do a bit of research first. Don't claim that uppercase extremism and then do not even superficially follow through)
hmmm:
$ grep -ri 'no regressions' Documentation/ | wc -l
13
$ grep -ri 'not break userspace' Documentation/ | wc -l
0
Also:
"WE DO NOT BREAK USERSPACE": 2 hits – https://lore.kernel.org/all/?q=f%3ATorvalds+%22WE+DO+NOT+BREAK+USERSPACE%22
"no regresssions": 44 hits –https://lore.kernel.org/all/?q=f%3ATorvalds%20%22no%20regressions%22
@kernellogger @pid_eins My impression, having more of an outside perspective and working with higher level languages: should deprecations perhaps always be gated with a config flag, perhaps even a common one similar to BROKEN?
With Java/Scala, it's always quite clear for me where deprecated methods are used. Also I can have builds fail due to that or not, so that I notice new deprecations when building / in CI.
there are various things that can work and I guess it depends on the situation what reasonable and effective.
For the kernel I something think "add delays (together with a msg in the logs) that grow longer and longer over time when people use deprecated stuff, at some point people get curious and will investigate" might be something that might help, OTOH it's a kind of stupid idea 😂
@DanielMicay Greg KH comment on this thread: https://social.kernel.org/objects/d1b25397-f5df-4bbc-9c5f-3abab89c5597 :-)