social.kernel.org

Conversation

Jann Horn

jann@infosec.exchange

Edited 1 month ago

one of my takeaways from this locking bug writeup https://fly.io/blog/parking-lot-ffffffffffffffff/ is that, yet again, timeouts are a terrible tool and likely to cause lots of trouble, because they are almost always going to lead to very under-tested codepaths on several layers of the system

3

0

0

Jann Horn

jann@infosec.exchange

Reply to @jann@infosec.exchange

Edited 1 month ago

("several layers" because something hits a timeout and then several layers up you have to deal with not being able to perform some operation which you probably weren't just trying to perform for fun, though that wasn't the problem in this case.)

1

0

0

Jann Horn

jann@infosec.exchange

Reply to @jann@infosec.exchange

and because computers are so nondeterministic in how much time stuff takes (people build random number generators based on how nondeterministic timings are!), the timeout path will be triggered essentially at random under load, and so you've essentially explicitly wired up a mostly-untested weird code path to a random number generator

0

1

1

Paul E. McKenney

Reply to @jann@infosec.exchange

@jann Timeouts. Can't live with them, can't live without them. ;-)

1

0

4

Jann Horn

jann@infosec.exchange

Reply to @paulmckrcu

@paulmckrcu speaking as a perfectionist who has never written major system components myself, my ivory tower perspective is that pretty much the only valid usecases for timeouts are:

a remote system is not responding, drop the connection and maybe connect again later
a hardware device is not responding, treat it as disconnected and maybe try to set it up from scratch later
retransmits inside a network protocol implementation for unreliable transports
warn-only timeouts that warn but keep waiting
mixing outputs for a local hardware device that has wall-clock time constraints (like compositing framebuffers or mixing audio output), where a timeout just means an input gets skipped for a round

and I think most other timeouts are misguided hacks that sacrifice correctness to reduce the perceived impact of preexisting bugs.

Anytime I have to look at stuff involving timeouts, it's because timeouts have consequences like "someone used a hook to perform unanticipatedly costly work, so the system shoots down that work in the middle and then restarts it from scratch in a loop" or "oh if this TLB flush takes too long let's just skip the flush and keep going as if everything's fine" or "an application is doing too much work without responding to the graphical environment for a few seconds so let's just kill the window"

2

0

0

Christian Brauner 🦊🐺

brauner@mastodon.social

Reply to @jann@infosec.exchange

Edited 1 month ago

@jann my fave quote (currently very on-topic for us) was:

"When the watchdog bounces a proxy, it snaps a core dump from the process it just killed. We are now looking at core dumps. There is only one level of decompensation to be reached below “inspecting core dumps”, and that’s “blaming the compiler”. We will get there."

0

0

0

Christian Brauner 🦊🐺

brauner@mastodon.social

Reply to @jann@infosec.exchange

@jann @paulmckrcu aren't interruptible/killable sleeping locks in the same category?

2

0

0

Jann Horn

jann@infosec.exchange

Reply to @brauner@mastodon.social

@brauner @paulmckrcu well... yes.

2

0

0

Christian Brauner 🦊🐺

brauner@mastodon.social

Reply to @jann@infosec.exchange

@jann @paulmckrcu people really love to add them. Especially in the VFS layer they want more of that. The root cause? Hanging networking filesystems...

1

0

0

Jann Horn

jann@infosec.exchange

Reply to @jann@infosec.exchange

@brauner @paulmckrcu though a difference is that interruptible/killable locks are sort of simpler, because there something else has already decided to interrupt the operation to either kill a process or to run some work in between and then possibly restart the syscall. while a timeout is the source of the interruption and may need a more custom plan for how to recover from not being able to perform the intended operation.

2

0

0

Christian Brauner 🦊🐺

brauner@mastodon.social

Reply to @jann@infosec.exchange

Edited 1 month ago

@jann @paulmckrcu They can also be easier to test. Especially timeout based locks with non-user controllable timeout would be rather hard to test correctly. Unless you have some sort of advanced "timeout injection".

1

0

0

Paul E. McKenney

Reply to @jann@infosec.exchange

@jann Another use case is periodic polling. Yet another is dealing with latency constraints, for example, compute it the hard and accurate way, but if that takes too long, compute it more quickly and less accurately. Another involves needing to do things that are not permitted in a restrictive context (e.g., interrupt handler), though things like irq work are better on architectures providing this.

All that aside, yes, I have seen timeouts used in cases where investing a little more thought might have provided a better solution. Then again, the same is true of pretty much all other facilities provided by the Linux kernel. ;-)

0

0

2

Paul E. McKenney

Reply to @brauner@mastodon.social

@brauner @jann Interruptible/killable sleeping locks can be used for latency reduction for the poor user who didn't realize how long things might take.

Of course, there just might a few Interruptible/killable sleep-lock failure paths in need of better testing... ;-)

1

0

3

Paul E. McKenney

Reply to @brauner@mastodon.social

@brauner @jann To be fair, that does fall into the first of Jann's valid use cases: "a remote system is not responding, drop the connection and maybe connect again later".

0

0

0

Jann Horn

jann@infosec.exchange

Reply to @paulmckrcu

@paulmckrcu @brauner it does seem like a good fit for the error injection subsystem to inject things like empty task work or SIGKILL...

1

0

1

Paul E. McKenney

Reply to @brauner@mastodon.social

@brauner @jann Or a way to fuzz the timers. Though some would argue that the kernel as it is does a pretty good job of this sort of fuzzing...

0

0

2

Paul E. McKenney

Reply to @jann@infosec.exchange

@jann @brauner Or to recover from the next timeout being unduly delayed, for that matter. That said, in my experience, most timeout handlers unconditionally do their work. On the other hand, I could easily believe that unconditional timeout handlers are much less likely to come to your attention. ;-)

0

0

2

Paul E. McKenney

Reply to @jann@infosec.exchange

@jann @brauner As always, the big question is "what would have detected the bug earlier and more deterministically?" Finding the extreme corner cases is inherently hard, because that is exactly why we call them "extreme".

1

0

2

Christian Brauner 🦊🐺

brauner@mastodon.social

Reply to @paulmckrcu

Edited 1 month ago

@paulmckrcu @jann I think that our interruptible/killable locks are barely tested (speaking for the VFS) because most of the time users really don't want to abort an fs operation on a local filesystem. If you initiated a write or a chmod()/chown() or whatever you very likely don't want it to just abort even if it takes a long time.

Now, this might be an entirely different story on a network filesystem of course.

The most likely case where this is "tested" is shutdown...

2

0

1

Christian Brauner 🦊🐺

brauner@mastodon.social

Reply to @brauner@mastodon.social

@paulmckrcu @jann where e.g., systemd might get tired of waiting for some operation to finish and SIGKILLs the offending task. I'm mostly handwaving of course.

1

0

1

Christian Brauner 🦊🐺

brauner@mastodon.social

Reply to @brauner@mastodon.social

@paulmckrcu @jann And there's a chance that we actually do have tests for that in xfstests. I can never remember because there are oh so many.

2

0

1

Christian Brauner 🦊🐺

brauner@mastodon.social

Reply to @brauner@mastodon.social

@paulmckrcu @jann What I don't like about interruptible/killable is that you bubble up an error from a pretty random point in the callchain that is often (not always) meaningless. IOW, it doesn't actually inform you why something did fail (ok ok, let's ignore the EINVAL hell that the kernel is in general for a second) but just that it it didn't complete in time.

1

0

1

Christian Brauner 🦊🐺

brauner@mastodon.social

Reply to @brauner@mastodon.social

@paulmckrcu @jann There are a few cases where I think it might make sense. For example, it is probably pretty annoying if the user tries to open a file on a frozen filesystem. If that thing stays frozen they just hung their task for an indefinite amount of time. So in that case I would even see the argument that if the fs is frozen you should generally be able to interrupt whatever you're doing but that's just too ugly to be maintained and realistically not a problem encountered all that often.

0

0

1

Paul E. McKenney

Reply to @brauner@mastodon.social

@brauner @jann Use AI to find the xfstest of interest? (OK, OK, I will get my coat...)

0

0

2

Paul E. McKenney

Reply to @brauner@mastodon.social

@brauner @jann It has been a very long time since I have seen a local filesystem operation take very long, so no argument here!

0

0

1

About social.kernel.org

Terms of service

Please do not use this service in violation of the Linux Kernel Code of Conduct. Doing so will result in your account suspension with the referral of the matter to the CoC committee.
"Repeating"/"boosting" someone else's status on this platform will be treated as endorsement and will fall under rule #1.
You are encouraged to use this platform to promote your work on the Linux Kernel, but there is no restriction on permitted topics (with the exception of anything covered by #1 above).
There is no requirement to post in English, but it should be considered the primary language of communication on this platform.

Privacy notice

The admins of this service have access to all posted statuses. They aren't looking, but if it's something they shouldn't know about, then you should not post it on this platform.

Please see the Linux Foundation Privacy Policy, which applies to this platform as well.

Getting your own account

If you would like an account on this instance, please check that the following applies to you:

You are listed in MAINTAINERS or CREDITS
OR: You have a kernel.org account or email address
OR: You have a long and established history of involvement with the Linux Kernel

If the above is true and you agree with the Terms of Service and Privacy Notice listed above, please use these instructions to request an account:

How to request an account on social.kernel.org