Conversation
Edited 2 days ago

one of my takeaways from this locking bug writeup https://fly.io/blog/parking-lot-ffffffffffffffff/ is that, yet again, timeouts are a terrible tool and likely to cause lots of trouble, because they are almost always going to lead to very under-tested codepaths on several layers of the system

3
0
0

("several layers" because something hits a timeout and then several layers up you have to deal with not being able to perform some operation which you probably weren't just trying to perform for fun, though that wasn't the problem in this case.)

1
0
0

and because computers are so nondeterministic in how much time stuff takes (people build random number generators based on how nondeterministic timings are!), the timeout path will be triggered essentially at random under load, and so you've essentially explicitly wired up a mostly-untested weird code path to a random number generator

0
1
1
@jann Timeouts. Can't live with them, can't live without them. ;-)
1
0
3

@paulmckrcu speaking as a perfectionist who has never written major system components myself, my ivory tower perspective is that pretty much the only valid usecases for timeouts are:

  • a remote system is not responding, drop the connection and maybe connect again later
  • a hardware device is not responding, treat it as disconnected and maybe try to set it up from scratch later
  • retransmits inside a network protocol implementation for unreliable transports
  • warn-only timeouts that warn but keep waiting
  • mixing outputs for a local hardware device that has wall-clock time constraints (like compositing framebuffers or mixing audio output), where a timeout just means an input gets skipped for a round

and I think most other timeouts are misguided hacks that sacrifice correctness to reduce the perceived impact of preexisting bugs.

Anytime I have to look at stuff involving timeouts, it's because timeouts have consequences like "someone used a hook to perform unanticipatedly costly work, so the system shoots down that work in the middle and then restarts it from scratch in a loop" or "oh if this TLB flush takes too long let's just skip the flush and keep going as if everything's fine" or "an application is doing too much work without responding to the graphical environment for a few seconds so let's just kill the window"

3
0
0

@jann @paulmckrcu I strongly suspect timeouts are a product of systems thoughtlessly being integrated with the internet as the easiest way to deal with 'oh something in the callstack might be relying on something that might be slow/delayed/gone wrong in the network'.

I've noticed how INCREDIBLY unreliable apps are in the face of even minor network issues.

Which is very ironic given the internet was specifically designed for robustness in the face of connectivity issues.

Sort of a lazy hack very often yes...

0
1
0

@jann my fave quote (currently very on-topic for us) was:

"When the watchdog bounces a proxy, it snaps a core dump from the process it just killed. We are now looking at core dumps. There is only one level of decompensation to be reached below “inspecting core dumps”, and that’s “blaming the compiler”. We will get there."

0
0
0

@jann @paulmckrcu aren't interruptible/killable sleeping locks in the same category?

3
0
0

@brauner @jann @paulmckrcu omfg you leave my mmap_write_lock_killable() alone!!

0
0
0

@jann @paulmckrcu people really love to add them. Especially in the VFS layer they want more of that. The root cause? Hanging networking filesystems...

1
0
0

@brauner @paulmckrcu though a difference is that interruptible/killable locks are sort of simpler, because there something else has already decided to interrupt the operation to either kill a process or to run some work in between and then possibly restart the syscall. while a timeout is the source of the interruption and may need a more custom plan for how to recover from not being able to perform the intended operation.

2
0
0

@jann @paulmckrcu They can also be easier to test. Especially timeout based locks with non-user controllable timeout would be rather hard to test correctly. Unless you have some sort of advanced "timeout injection".

1
0
0
@jann Another use case is periodic polling. Yet another is dealing with latency constraints, for example, compute it the hard and accurate way, but if that takes too long, compute it more quickly and less accurately. Another involves needing to do things that are not permitted in a restrictive context (e.g., interrupt handler), though things like irq work are better on architectures providing this.

All that aside, yes, I have seen timeouts used in cases where investing a little more thought might have provided a better solution. Then again, the same is true of pretty much all other facilities provided by the Linux kernel. ;-)
0
0
2
@brauner @jann Interruptible/killable sleeping locks can be used for latency reduction for the poor user who didn't realize how long things might take.

Of course, there just might a few Interruptible/killable sleep-lock failure paths in need of better testing... ;-)
1
0
3
@brauner @jann To be fair, that does fall into the first of Jann's valid use cases: "a remote system is not responding, drop the connection and maybe connect again later".
0
0
0

@paulmckrcu @brauner it does seem like a good fit for the error injection subsystem to inject things like empty task work or SIGKILL...

1
0
1
@brauner @jann Or a way to fuzz the timers. Though some would argue that the kernel as it is does a pretty good job of this sort of fuzzing...
0
0
2
@jann @brauner Or to recover from the next timeout being unduly delayed, for that matter. That said, in my experience, most timeout handlers unconditionally do their work. On the other hand, I could easily believe that unconditional timeout handlers are much less likely to come to your attention. ;-)
0
0
2
@jann @brauner As always, the big question is "what would have detected the bug earlier and more deterministically?" Finding the extreme corner cases is inherently hard, because that is exactly why we call them "extreme".
1
0
2

Christian Brauner 🦊🐺

Edited 4 hours ago

@paulmckrcu @jann I think that our interruptible/killable locks are barely tested (speaking for the VFS) because most of the time users really don't want to abort an fs operation on a local filesystem. If you initiated a write or a chmod()/chown() or whatever you very likely don't want it to just abort even if it takes a long time.

Now, this might be an entirely different story on a network filesystem of course.

The most likely case where this is "tested" is shutdown...

1
0
0

@paulmckrcu @jann where e.g., systemd might get tired of waiting for some operation to finish and SIGKILLs the offending task. I'm mostly handwaving of course.

1
0
0

@paulmckrcu @jann And there's a chance that we actually do have tests for that in xfstests. I can never remember because there are oh so many.

1
0
0

@paulmckrcu @jann What I don't like about interruptible/killable is that you bubble up an error from a pretty random point in the callchain that is often (not always) meaningless. IOW, it doesn't actually inform you why something did fail (ok ok, let's ignore the EINVAL hell that the kernel is in general for a second) but just that it it didn't complete in time.

1
0
0

@paulmckrcu @jann There are a few cases where I think it might make sense. For example, it is probably pretty annoying if the user tries to open a file on a frozen filesystem. If that thing stays frozen they just hung their task for an indefinite amount of time. So in that case I would even see the argument that if the fs is frozen you should generally be able to interrupt whatever you're doing but that's just too ugly to be maintained and realistically not a problem encountered all that often.

0
0
0