social.kernel.org

Conversation

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Memory errors in consumer devices such as PCs and phones are not something you hear much about, yet they are probably one of the most common ways these machines fail.

I'll use this thread to explain how this happens, how it affects you and what you can do about it. But I'll also talk about how the industry failed to address it and how we must force them to, for the sake of sustainability. 🧵 1/17

2

13

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

First of all let's talk briefly about how memory works. What you have in your PC or phone is what we call dynamic random access memory. That is memory that stores bits by putting a minuscule amount of charge into vanishingly small capacitors (or not putting it in if we're storing a zero).

These capacitors continuously leak this charge, so it needs to be refreshed periodically - every few milliseconds - which is why it's called "dynamic". 2/17

1

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

This design is *extremely* analog in nature. When your machine needs to read some bits the capacitors holding them are connected to a bunch of wires. The very small voltage difference that happens in the wire is detected by the use of a circuit that turns it into a clear 0 or 1 value (this is called a sense amplifier). 3/17

1

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

So how can this fail? In a huge number of ways. Circuits age with time and use. The ability of the individual capacitors to hold the charge goes down slowly over time, the transistors in the sense amplifiers degrade, points of contact oxidize, etc... Past a certain point this can make the whole process end up outside of the thresholds required to reliably read, write and retain the bits in memory. 4/17

1

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

This can lead to different failures: a very common one is a stuck bit, which ends up being always read as 1 or 0, regardless of what was written into it. Another type is timing-dependent failures, which cause a bit to flip but only if it's not touched in due time by an access or a refresh. More catastrophic errors can affect entire lines - which is what happens when a sense amplifier starts to fail. 5/17

2

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

Either way, even a single bit error which happens once in a blue moon is catastrophic to a consumer machine. Sometimes it will cause a pixel to slightly change color, but sometimes it will affect an important computation and lead to a crash. Or worse: it'll cause some user data to be corrupted before it's written to disk, and when it is, the damage has become permanent. 6/17

1

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

If your machine exhibits rare but hard-to-explain crashes, or if you're forced to reinstall programs - or even the operating system - because of mysterious failures, or experience random reboots or BSODs, then it's very likely that your memory is failing and you need to replace it. 7/17

1

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

Diagnosing it is hard. Windows has a memory diagnostic tool which will catch the worst offenders and is easy to use: https://www.microsoft.com/en-us/surface/do-more-with-surface/how-to-use-windows-memory-diagnostic

It's not enough though, some errors can only be caught with more extensive testing. I recommend the open-source memtest86+ (https://memtest.org/) tool or the closed source memtest86 one (https://www.memtest86.com/) 8/17

1

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

Naturally what happens on PCs also happens on phones, network devices, printers, TVs, etc... but you can't test them. This is a disaster because these failures are common, and they become more and more common as the device ages. If we want to have repairable devices that last for a long time, the industry will have to change its practices, but more about this later. 9/17

2

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

Now you might wonder: how often does this actually happen? The common wisdom on this topic is that hardware failures are so rare that software bugs will always dwarf them. As I found out this is demonstrably false.

While investigating Firefox crashes I've come to the conclusion that several of the most common issues we were dealing with were likely caused by flaky hardware. This led me to come up with a simple heuristic to detect crashes potentially caused by bit-flips. 10/17

1

4

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

Deploying this heuristic to Mozilla's crash reporting infrastructure has been eye opening: if I take the 10 most common crashes on Windows, 7 are out-of-memory conditions - that is, not bugs - and 3 are likely caused by bad memory.

You've read that right, three out of the ten most common reasons why Firefox crashes on Windows are caused by memory that's gone bad. 11/17

1

2

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

Now there's a few things that are worth mentioning: users with bad hardware will be over-represented in this category, their machines will crash far more often than others.

The second thing is that Firefox is exceptionally stable, we've driven down its crash rate by more than >70% in the last few years. But Firefox is also a 30 million-lines-of-code monster. There are bugs in there, but they're less common than hardware failures! 12/17

1

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

Plotting these types of crashes against time yields interesting trends: the more machines age the more likely they are to encounter hardware-related failures. You might think that's obvious, and indeed it is, but until now the industry has looked the other way, based on the hand-wavy excuse that hardware failures were less common than bugs. 13/17

1

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

So what needs to change? First of all, error detection and correction must become commonplace. You can already build a desktop machine with #ECC memory, but it's uncommon in laptops, even mobile workstations, and completely absent on phones and other consumer appliances. This will measurably lengthen the usable life of these devices. 14/17

2

3

1

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

Note that detection is more important than correction. The user needs to know that there's something wrong without having to run a memory testing program. Think of the lights that turn on in cars if something's malfunctioning, or the error beeps that your washing machine makes when it thinks it's leaking water. These are extremely common, they need to be on computing devices too. 15/17

2

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

Finally hardware design must change to make devices repairable and prolong their useful life. Yes, I'm looking at non-ECC memories soldered on the motherboard or worse, on the same substrate as the CPU. 16/17

1

0

1

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

To end the thread I'd like to thank my colleagues Alex Franchuk and @willcage who did the implementation work and my boss Gian-Carlo Pascutto who plotted crashes against machine age. I'd also like to point out that we've got preliminary data on the topic, but I fully intend to write a proper article with a detailed analysis of the data. 17/17

1

0

0

Jon (now at neuromatch.social)

jdp23@indieweb.social

Reply to @gabrielesvelto@fosstodon.org

@gabrielesvelto really interesting thread, thanks for writing it up!

0

0

0

Ondřej Pokorný

ondra@unextro.net

Reply to @gabrielesvelto@fosstodon.org

Edited 1 year ago

@gabrielesvelto Also, high energy particles from space affect even brand new RAM chips...
https://www.bbc.com/future/article/20221011-how-space-weather-causes-computer-errors

0

0

0

William D. Jones

cr1901@mastodon.social

Reply to @gabrielesvelto@fosstodon.org

@gabrielesvelto Really depressing that we've reached the physical limits of creating "memory we're confident that actually will store it's value reliably" :(.

We've went from PARITY CHECK 1/2 to "memory works fine without detection or correction" to "oh now not even parity check is enough". In that sense, it's WORSE than 40 years ago :P.

1

0

0

Gabriele Svelto [moved]

gabrielesvelto@fosstodon.org

Reply to @cr1901@mastodon.social

@cr1901 yes, it is worse than 40 years ago! This is an area where we've actively regressed

1

0

0

@jbqueru@floss.social

jbqueru@fosstodon.org

Reply to @gabrielesvelto@fosstodon.org

@gabrielesvelto @cr1901 maybe that's why it sometimes felt like those old machines were rock-solid in spite of their limitations: hardware has become less reliable faster than software became more reliable.

0

0

0

Palmer Dabbelt

palmer

Reply to @gabrielesvelto@fosstodon.org

@gabrielesvelto IIRC the design point failure rate for consumer storage/memory was that it just had to be reliable enough that customers assumed it was Windows crashing and not the HW being broken.

0

0

2

Tommy Þ

tommythorn@chaos.social

Reply to @gabrielesvelto@fosstodon.org

Edited 1 year ago

@gabrielesvelto Working with existing non-ecc system, couldn't some of this be caught if the OS ran a low-priority process scrubbing memory, eg. writing and checking a random, but check-summed pattern to free pages (similar to ZFS scrubbing). Even better, important data structures should be check summed (something I actually did in a database engine I wrote many decades ago).

0

0

0

PoroCYon

pcy@icosahedron.website

Reply to @gabrielesvelto@fosstodon.org

@gabrielesvelto FYI, this seems to be a nonnegligible cause of death of Nintendo 3DS units. some units either exhibit strange behavior, corruption in text characters (which turn out to be single bitflips), or just straight up refuse to boot

it's possible to demonstrate these are indeed DRAM* errors by using the boot9strap jailbreak (with ntrboot if not installed beforehand), as these run from SoC-internal SRAM instead of DRAM. booting the OS then typically fails, and it can also be used as a point to run a memtest

problem is that replacing the DRAM chips is *very* difficult. not only would it require BGA rework (because it's so small they couldn't not solder it on the mobo), Nintendo also used epoxy 'underfill' to glue the DRAM chip stuck to the PCB to deter RAM probing attacks (as those were used against the DSi, the 3DS' predecessor), see the "white glue" here: https://giltesa.com/wp-content/uploads/2013/12/Nintendo_3DS_PCB-Top.jpg

*: DRAM is more often called FCRAM on the 3DS because that's the type of DRAM by fujitsu it uses

1

0

0

Wyatt (🏳️‍⚧️♀?)

wyatt8740@tech.lgbt

Reply to @pcy@icosahedron.website

Edited 1 year ago

@pcy @gabrielesvelto early-revision SNES consoles also fail regularly, and it seems most people point to the VRAM as the failure point. Although I'm not sure if that's been confirmed.
youtube video ID (of a video capture I made): e3P1c9eXKqo

0

0

0

About social.kernel.org

Terms of service

Please do not use this service in violation of the Linux Kernel Code of Conduct. Doing so will result in your account suspension with the referral of the matter to the CoC committee.
"Repeating"/"boosting" someone else's status on this platform will be treated as endorsement and will fall under rule #1.
You are encouraged to use this platform to promote your work on the Linux Kernel, but there is no restriction on permitted topics (with the exception of anything covered by #1 above).
There is no requirement to post in English, but it should be considered the primary language of communication on this platform.

Privacy notice

The admins of this service have access to all posted statuses. They aren't looking, but if it's something they shouldn't know about, then you should not post it on this platform.

Please see the Linux Foundation Privacy Policy, which applies to this platform as well.

Getting your own account

If you would like an account on this instance, please check that the following applies to you:

You are listed in MAINTAINERS or CREDITS
OR: You have a kernel.org account or email address
OR: You have a long and established history of involvement with the Linux Kernel

If the above is true and you agree with the Terms of Service and Privacy Notice listed above, please use these instructions to request an account:

How to request an account on social.kernel.org