social.kernel.org

Conversation

Gabriele Svelto

gabrielesvelto@mas.to

A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5

25

30

1

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @gabrielesvelto@mas.to

In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger. Still, out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much. 2/5

3

3

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @gabrielesvelto@mas.to

In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%. This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem. 3/5

7

7

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @gabrielesvelto@mas.to

And to reinforce this estimate I've looked at the numbers we got from the users who run the memory tester after having experienced a crash: for every two crashes we think are caused by a bit-flip the memory tester found one genuine hardware issue. Keep in mind that this is not doing an extensive test of all the machine's RAM, it only checks up to 1 GiB of memory and runs for no longer than 3 seconds... and it has found lots of real issues! 4/5

4

2

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @gabrielesvelto@mas.to

And for the record I'm looking at this mostly on computers and phones, but this affects *every* device. Routers, printers, etc... you name it. That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job. 5/5

19

4

0

guenther

guenther@chaos.social

Reply to @gabrielesvelto@mas.to

Edited 3 months ago

@gabrielesvelto is it lots of different devices, each one experiencing rare crashes at random, or is there a small number of really shitty computers accounting for a large share of the crashes?

3

0

0

guenther

guenther@chaos.social

Reply to @guenther@chaos.social

@gabrielesvelto and what is the ratio of people who ever get a (bit-flip) crash out of all those who opted into this telemetry?

0

0

0

Julius Schwartzenberg - Юліус

jschwart@mas.to

Reply to @gabrielesvelto@mas.to

@gabrielesvelto hopefully those MacBooks could run Linux with the badram option.

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @guenther@chaos.social

@guenther I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate.

2

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @gabrielesvelto@mas.to

@guenther generally speaking a single machine won't send a lot of crashes. It's very common that they only have one bad bit across their whole installed RAM. They'll hit it eventually, especially if it's in the lower address ranges, but not all of the time. And in order to crash some important data needs to end up there, like a pointer or an instruction.

1

0

0

Raul

raulinbonn@treehouse.systems

Reply to @gabrielesvelto@mas.to

Edited 3 months ago

@gabrielesvelto As a personal anecdote, I built a PC in 2017 with 16 GB of DDR4 RAM that I got from Amazon (Germany.) Had to return it after extensive testing with Passmark's free version of memtest86. Had failing bits. The replacement did pass the heavy testing. If there's one thing I wanted that PC to be was very stable and reliable.

Few years later got a second 16 GB bit to expand it to 32 GB. Had to return that kit as well, it also had errors. The replacement again passed the extensive testing. This is still the PC I'm writing from now in fact.

Manufacturers and their QA teams must be aware of their failure rates, but they likely do not care to save costs and make higher profits. They still sell kits with some failures, because not many users subject their PCs/RAM to the torture of these long RAM tests (4 full passes or more for sanity's sake takes hours.) And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately. From my experience, the "RAM Test" offered by Windows was an absolute joke. It never found anything on the kits that Memtest86 would find failures on in about 1 of any 2 runs.

I remember watching a Youtuber testing a gaming build he had just put together, and he used prime95 to test it for some minutes only. The computer did not crash and according to him that was fine enough for a gaming PC. I happen to disagree. In particular because in that run of his, even if Prime95 did not crash, it showed calculation error warnings. That could have happened because of RAM issues. In his view, just for gaming it was fine enough that Prime95 would not crash quickly, much better even if it endures some minutes. I disagree. Any calculation error from Prime95 is quite a hardware stability/reliability red flag, just as any finding from memtest86.

It is a failure of the industry that ECC RAM is still not standard at least for PCs, laptops, and cellphones. Maybe it should be standard for all consumer electronics in fact.

3

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @raulinbonn@treehouse.systems

@raulinbonn yes, both hardware and big software vendors have handwaved this problem away for years by claiming that software bugs are more common. In my testing hardware issues are common enough that they often drown the software issues.

0

0

0

adingbatponder 👾

adingbatponder@fosstodon.org

Reply to @gabrielesvelto@mas.to

@gabrielesvelto I did not know that bit flips refer to reproducible bad ram issues....I thought they are random...

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @adingbatponder@fosstodon.org

@adingbatponder people and research usually focused on random bit-flips caused by high-energy radiation and similar phenomenons. Actual RAM going bad is a poorly documented and researched problem, mostly because the industry doesn't care. This is a more extensive thread on the issue: https://fosstodon.org/@gabrielesvelto/112407741329145666

2

1

0

Derick Rethans

derickr@phpc.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto How can I inspect this data on my local machine? Because I am suspecting I have a bit-flippy bit of memory :-/

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @derickr@phpc.social

@derickr check out the tools I linked to in this post: https://fosstodon.org/@gabrielesvelto/112407745077972912

0

0

0

Raul

raulinbonn@treehouse.systems

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @adingbatponder At the end of that thread: " I'd also like to point out that we've got preliminary data on the topic, but I fully intend to write a proper article with a detailed analysis of the data. 17/17"

Was that article published, or is it approaching publication? I'd be very interested.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @raulinbonn@treehouse.systems

@raulinbonn @adingbatponder I never had the time to write it, it's on my TODO list for this year

0

0

0

d@nny disc@

hipsterelectron@circumstances.run

Reply to @gabrielesvelto@mas.to

@gabrielesvelto i have found every one of your discussions of this topic immensely fascinating and have been able to revise many assumptions i had about the cpu and memory system. i want to additionally commend you for both identifying that more invasive telemetry could have been useful and then making it unequivocal that it's always opt-in and still anonymized on top of that. i have had to push back very strongly on this sort of thing before and it takes my breath away to find someone else with extremely high standards for measurement work and user safety

1

0

0

Steven Op de beeck

stevenodb@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto My mind was going to cheap low-end hardware, but now you’re throwing expensive Apple Silicon SOC’s in the mix, it’s a bit harder to believe that they suffer from bitflips at the rates your are implying.

2

0

0

Val Packett 🧉

valpackett@treehouse.systems

Reply to @stevenodb@mastodon.social

@stevenodb @gabrielesvelto low-end hardware might sometimes even be less likely to hit this because it's not even trying to be super fast. High-end hardware chasing the fastest speeds is pushing the limits of stability all the time.

0

0

0

Dag Ågren ↙︎↙︎↙︎

WAHa_06x36@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto I work on a phone app with a very large install base and a decent crash rate. The crash reporter is just a sea of single-instance crashes. Crashes we see once, ever, and then never again.

During new version rollouts I've introduced the rule that "one crash is zero crashes". That is, don't even think about investigating a new crash in a new release until there are at least two of them.

I've always assumed that these are bit flips, so really good to see some evidence!

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @WAHa_06x36@mastodon.social

Edited 3 months ago

@WAHa_06x36 yes, keep in mind that in our case I've worked on some of this stuff because our signal-to-noise ratio in crash reports was getting awfully low, with low-volume valid crashes being swamped out by bad hardware

0

0

0

🌈 Andrew ☄️

bnut@aus.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto so if you halved Firefox’s memory footprint you’ll reduce crashes by at least 7%?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @bnut@aus.social

@bnut not really, out-of-memory crashes are bizarre unintuitive beasts. For starters they only ever happen on Windows, never on macOS and very rarely on Linux. When they happen it's not because the user actually ran out of memory, it's because they ran out of commit-space. See a discussion on it and how we reduced OOM crashes by some ~80% a few years ago in this article: https://hacks.mozilla.org/2022/11/improving-firefox-stability-with-this-one-weird-trick/

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @hipsterelectron@circumstances.run

@hipsterelectron thank you 🙏

0

0

0

Eric Seppanen

ericseppanen@hachyderm.io

Reply to @gabrielesvelto@mas.to

@gabrielesvelto On Linux at least it is possible for the kernel to quarantine the physical pages containing bad bits. At a previous job we used this to remotely repair expensive appliances that would have required an onsite technician to swap out the whole unit.

It’s not widely used, and pinpointing the bad pages isn’t easy (or possible from userspace, afaik). But maybe now that DRAM is expensive again it could be improved.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @ericseppanen@hachyderm.io

@ericseppanen I would argue for widespread use of ECC memory. The price of doing it in hardware would be small, certainly smaller than the damage caused by bad memory. But even if one doesn't want to pay it there's always the possibility of using inline ECC in modern SoCs that have both an integrated memory controller and caches: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/psdk_rtos/docs/user_guide/developer_notes_ddr_inline_ecc.html

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @stevenodb@mastodon.social

@stevenodb high-end hardware pushes the limits of semiconductors both because of small feature size and high clocks increasing the chances of failure. In the case of DRAM the chance of malfunction increases with higher temperature too, as the ability of trench/stacked capacitors to retain charge degrades with it... and Apple puts its DRAM right next to the CPU, the single hottest place of the whole device.

0

0

0

datum (n=1)

datum@zeroes.ca

Reply to @gabrielesvelto@mas.to

@gabrielesvelto this makes me think

decreasing size in RAM would inherently decrease physical errors
there will be undercounting from bit-flips which don't cause crashes (if a bit flips in the text I'm entering now, it'd be a typo not a crash)
maybe non-RAM physical errors could be estimated by looking at crashes from machines with ECC?

but of those the most fascinating outcomes to me are that

the same logically correct codebase, compiled to two binaries of different size, should crash less as the smaller binary
changes to code that reduce its compiled size will decrease crashes, if correctness is unchanged

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @datum@zeroes.ca

@datum yes, absolutely. Coincidentally the bulk of Firefox code was compiled for size, not speed by default, as smaller code proved faster in such a large codebase. Nowadays it's a complex PGO/LTO dance but the focus on small executable footprint has remained.

0

0

0

ARGVMI~1.PIF

argv_minus_one@mastodon.sdf.org

Reply to @raulinbonn@treehouse.systems

> And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately.

Are these people aware that a bit flip in some file system code could nuke somebody's hard drive?

@gabrielesvelto

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @argv_minus_one@mastodon.sdf.org

@argv_minus_one @raulinbonn yes, the worst outcome of a bit-flip is when data that will be written to disk happen to overlap it, which then makes it all the way to the drive. And BTW this is one of the reasons why competent filesystems should always implement checksums for both data and metadata, as it increases the chances of detecting these issues early, before they do permanent damage.

1

0

0

Trouble

trouble@masto.ai

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @adingbatponder I had a coworker whose PhD was analyzing bit errors, and concluded that running without at least ecc ram, particularly in a data center setting was madness. But it had serious repercussions for users doing data analysis on their (non ecc) desktops for research.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @trouble@masto.ai

@trouble @adingbatponder yes, at the datacenter level the amount of errors you get is enormous. SECDED ECC doesn't cut it there anymore so usually more robust detection/correction systems are used.

0

0

0

Nathan Vander Wilt

natevw@toot.cafe

Reply to @gabrielesvelto@mas.to

@gabrielesvelto very cool! is the logic open source somewhere you can link, or [even more lazily 😇] is the heuristic easy for you to summarize in broad strokes?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @natevw@toot.cafe

@natevw absolutely! The logic is very simple in principle and described here: https://bugzilla.mozilla.org/show_bug.cgi?id=1738651#c0

We tweaked it and rely on disassembly of the executing instruction to more accurately calculate the real address to test for bit-flips. You can find the code here, in our crash analysis tooling:

https://github.com/rust-minidump/rust-minidump/blob/26cd5dd6eb0ddfb29bcb7ed2d8c75cd8e6368203/minidump-processor/src/processor.rs#L1463

1

0

0

gaytabase

dysfun@treehouse.systems

Reply to @gabrielesvelto@mas.to

@gabrielesvelto holy shit, so what, you're just writing a gig of data and reading it back? and that works well enough?!

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @dysfun@treehouse.systems

@dysfun yes, believe me when I tell you that I was as surprised as you are. We try to do a few different tests (like writing different patterns before reading them back) but in a nutshell, that's it. Write, read back, check if it matches.

0

0

0

toadjaune

toadjaune@hostux.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto I'm curious, are you considering displaying this information to the user ?

Like, in the crash report window "it looks like your memory might be faulty, you may want to test it [link to a page explaining how to test memory]"

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @toadjaune@hostux.social

@toadjaune yes, that's something I'm planning to do. However it needs very careful UI/UX design to avoid looking like one of the many, many scams that pretend your machine is borked.

1

0

0

Cliff'sEsportCorner

CliffsEsport@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Couple questions. You said in previous post potentially & heuristic. But in this post you drop that qualifier. That is two qualifiers correct? Also FF should be using only relatively small amount of system RAM correct? So shouldn't statistically, if it was bad was RAM causing the FF problem, the user be seeing a LOT of bad RAM issues system wide?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @CliffsEsport@mastodon.social

@CliffsEsport we've had this data for years and it's always been consistent, we even verified it via actual memory testing so I can safely claim that it is indeed real. Additionally we know that we're undercounting these defects, we've designed the system to be conservative so the real numbers are certainly higher, even potentially much higher than what the heuristic tells us.

1

0

0

Anders Lindqvist

breakin@mastodon.gamedev.place

Reply to @gabrielesvelto@mas.to

@gabrielesvelto How does these numbers look if we instead count by unique callstacks? It feels intutive that bad memory can crash anywhere but other crashes are restricted to bugs…?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @breakin@mastodon.gamedev.place

@breakin that's a very good observation, yes these crashes cluster under some specific bugs. I discovered the problems years ago when my colleagues working on the garbage collector were trying to diagnose crashes that appeared to be impossible, and indeed they were. Because the garbage collector sweeps large amounts of memory containing structured data it was far more likely to encounter a problem than other code.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @gabrielesvelto@mas.to

@breakin other examples of code that frequently hits bad memory is code that manipulates very large collections such as hash-tables. Traversing lots and lots of pointers means that you're more likely to find a broken one.

0

0

0

sayrer

sayrer@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto it's real. we saw this at Twitter while chasing down 404s. Someone wrote a large analysis of strange username typos that couldn't be human error when this paper came out: https://web.archive.org/web/20180713212603/http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinaburg_Bitsquatting_WP.pdf

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @sayrer@mastodon.social

@sayrer sifting through any significant amount of data sent by machines without ECC memory reveals lots and lots of those issues: https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-what-flips-your-bit/

0

0

0

ShadSterling

ShadSterling@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto so will the next new feature in rust be some kind of memory checksumming? Almost like a virtual ECC

1

0

0

ShadSterling

ShadSterling@mastodon.social

Reply to @ShadSterling@mastodon.social

@gabrielesvelto might be hard to do as a compiled language feature, but any interpreter/VM could do it

1

0

0

ARGVMI~1.PIF

argv_minus_one@mastodon.sdf.org

Reply to @ShadSterling@mastodon.social

Any compiler or interpreter could do this. After every store, do a clflush, load, compare, and panic if there's a mismatch.

Problem is this would explode executable size (that's a lot of extra instructions) and ruin performance (that's a lot of extra instructions, and also you're effectively disabling the entire CPU cache).

@gabrielesvelto

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @argv_minus_one@mastodon.sdf.org

@argv_minus_one @ShadSterling it can be done at the hardware level even in the absence of the extra physical memory required for implementing traditional SECDED ECC. It's called inline ECC and sacrifices a chunk of your memory plus a marginal amount of performance, but it's already available, especially in the embedded market.

0

0

0

Adrian Mester

adrianmester@hachyderm.io

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Linus Torvalds recently said that ECC memory is very important to him and that he doesn't understand why it's not used everywhere

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @adrianmester@hachyderm.io

@adrianmester yes, we've been beating this drum for decades. There are threads where we discussed this on realworldtech's forums that go back to the '10s

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to

@brotherpsyche impossible to say without more data. I can tell you I routinely stumble upon crashes that are likely caused by overclocking, but I can't be sure because we don't collect this information.

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @gabrielesvelto@mas.to

@CliffsEsport and to answer your second question, that's absolutely what happens on user systems. If they have bad RAM they're seeing issues all over the place, including data corruption.

0

0

0

Bruno Philipe

brunoph@breakpoint.cafe

Reply to @gabrielesvelto@mas.to

@gabrielesvelto if you want some field data: I have a PC with 4 memory slots. When populated with 2x16GB cards overclocked to their XMP speed, all memory tests pass. When populated with 4x16GB cards overclocked to their XMP speed (all the exact same model), memory tests fails consistently on a very high-address bit. I had to slightly under-volt them below their XMP speed to get tests to pass. I would only ever get crashes on games and other heavy-load processes.

1

0

0

Bruno Philipe

brunoph@breakpoint.cafe

Reply to @brunoph@breakpoint.cafe

@gabrielesvelto my guess is that a lot of people who enable XMP never test their memory and assume it should "just work." My guess is that it's my chipset which isn't good enough to handle all that memory bandwidth, as tests fail on the same address bit regardless of the permutation of RAM boards. So I am not surprised at all that there's plenty of weird memory bugs out in the field, especially for a popular program.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @brunoph@breakpoint.cafe

@brunoph if you see the same hardware bit even by moving around the DIMMs then yes, the problem is likely either in the memory controller or it arises because of interference along the traces that lead to the DIMM slots because of a specific access pattern

0

0

0

shac ron ₪‎

shac@ioc.exchange

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Someone on the Chrome team at Google told me years ago they they can diagnose a user’s machine has bad RAM from crash logs.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @shac@ioc.exchange

@shac it's very easy if you collect data that identifies the user. Users with bad RAMs will have plenty of apparently random crashes. In our case we explicitly don't collect identifiable user information, so we need to analyze the crashes themselves. I can tell you with quite a bit of pride that we did this before Google and better than how they do it.

0

1

0

ARGVMI~1.PIF

argv_minus_one@mastodon.sdf.org

Reply to @gabrielesvelto@mas.to

Edited 3 months ago

@gabrielesvelto

Checksums will let you detect that there is a problem, but won't actually save your data. If a bit flip causes the file system driver to write to the wrong LBA or corrupt a key file system data structure or something, the damage will still be quite permanent unless you have a backup.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @argv_minus_one@mastodon.sdf.org

@argv_minus_one @raulinbonn yes, detection can only do so much. Even in our case there are indirect crashes that we simply cannot detect. For example in Rust code a bit-flip will often cause an invariant to be broken, so the code will fail cleanly but the origin of the crash will be lost as it happened before the point where we can detect it.

0

0

0

Lydia Trivia

lydiafacts@chaos.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto I'm very interested in implementing this at work, but I'm struggling to find more information. Could you explain how you detect possible memory errors or link to an explanation or code?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @lydiafacts@chaos.social

@lydiafacts it's really quite simple (which is why it typically undercounts the errors), I've described it here:

https://bugzilla.mozilla.org/show_bug.cgi?id=1738651#c0

The code is here: https://github.com/rust-minidump/rust-minidump/blob/26cd5dd6eb0ddfb29bcb7ed2d8c75cd8e6368203/minidump-processor/src/processor.rs#L1463

This is all under a permissive license and easily embeddable, so by all means make use of these tools if you can. I went to great lengths to make sure this would be useful outside of Mozilla, so it could benefit the whole FOSS ecosystem.

0

0

0

Eric Seppanen

ericseppanen@hachyderm.io

Reply to @gabrielesvelto@mas.to

@gabrielesvelto ECC DRAM would improve things, but people have been arguing this for decades and if it hasn't happened yet I doubt there's much you or I can do to change that.

On the other hand, if someone with an interest in systems programming wanted to build an always-on bad DRAM detector, I would encourage them to try.

It would be a nice gift to the world to prevent all those existing computers from going to the landfill over a few bad DRAM pages soldered to the mainboard. It might even enable a secondary market for mostly-good-but-slightly-bad DRAM modules to be bought at a discount.

1

0

0

Eric Seppanen

ericseppanen@hachyderm.io

Reply to @ericseppanen@hachyderm.io

Here's one possible design of an always-on bad-memory handler:

- A kernel API for allocating physical pages.
- A kernel API to quarantine physical pages.
- A userspace daemon that periodically allocates, tests, and releases (or quarantines) physical pages.
- A heuristic for determining if physical addresses are actually bad (and not the result of system flakiness or software bugs).
- A way to persistently store lists of bad pages, and tooling to maintain those lists.
- A way to load the bad pages list at boot (kernel params exist, but the list could be quite large...)

I'm sure a few of these exist (my knowledge is quite out of date), but to the best of my knowledge nobody has put them all together in a form that could be enabled by default and then mostly left to run on its own forever.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @ericseppanen@hachyderm.io

@ericseppanen our design could easily be reproduced at the kernel-level and it would work a lot better there. When a process crashes scan the physical memory that was allocated to it with a memory tester, if something bad comes up (permanently) unmap the physical page. The fact that it works well for us in userspace means that in the kernel it would do wonders.

0

0

0

Piko Starsider

starsider@valenciapa.ws

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Oh wow, I didn't know it was so prevalent. I got bad ram and I patched it up with linux cmdline stuff. Maybe I should make some kind of GUI to make that process easier for people. I made a little C app that allocates as much as possible and then does some crude ram test with it, and I detected errors without the cmdline and no errors with. The application can't tell me the physical addresses, of course, but it's a good indicator of whether my bad ram was properly skipped.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @starsider@valenciapa.ws

@starsider yep, the kernel could do this on the user's behalf like we do in userspace. It would work a lot better there too.

1

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @guenther why would the lower address ranges be special? This is confusing to me. I can't imagine firefox runs on any OS without virtual memory (or ASLR), so it doesn't seem like that should correlate strongly with any physical aspect.

1

0

0

Piko Starsider

starsider@valenciapa.ws

Reply to @gabrielesvelto@mas.to

@gabrielesvelto By "linux cmdline" I mean the kernel arguments when booting it. Ideally there should be a tool that add these automatically.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @starsider@valenciapa.ws

@starsider memtest86 & memtest86+ support emitting this information and you can feed it back into GRUB. It's not completely automated by it's the closer you get to that

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @vathpela@infosec.exchange

@vathpela @guenther I meant in the lower *physical* address ranges, because it's more likely to be used early on even on a lightly loaded machine. I once had a laptop with a bad bit at the very end of the physical range, I would hit it only when running Firefox OS builds which were massive (basically building Firefox + a good chunk of Android's base system at the same time)

1

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @guenther That still seems really weird to me - why would firefox be likely to get a low physical address? If anything is likely to have a higher chance of getting that memory, I would think it would be the kernel (which has genuine lowmem requirements sometime for e.g. dma bufs and such on some platforms), but a userland process seems odd.

1

0

0

Paul Wouters 🇪🇺🇨🇦

letoams@defcon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto is there a way I can run that test and see the result during a Firefox crash for me? (Even though most likely cusses of my crashes are out of memory issues 🤪)

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @letoams@defcon.social

@letoams if the entire browser crashes (not just a tab) there will be a checkbox that's on by default which will run the test. Unfortunately we don't have a way to surface the result for users yet. That being said if you fear you might have bad RAM try testing it with memtest86+: https://www.memtest.org/

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @vathpela@infosec.exchange

@vathpela @guenther oh it's not for Firefox specifically. Users with bad bits in lower address ranges will be more likely to encounter problems with *everything*, including the kernel. I also don't literally means the *lowest* ranges. Say, if you have a bad bit in the first GiB of physical memory you'll see its effects far more often than if you have it in the last one on a 32 GiB machine

1

0

0

🌈 Andrew ☄️

bnut@aus.social

Reply to @gabrielesvelto@mas.to

Edited 3 months ago

@gabrielesvelto Interesting thanks! Although sorry, I meant if the memory footprint is smaller wouldn’t that reduce the chance of hitting a flipped bit? Although the program bits are likely only a few hundred megabytes. So the crash reduction wouldn’t scale linearly. I imagine most of it is going to be images or video, where I guess flipping may cause bad display, not a crash.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @bnut@aus.social

@bnut indeed, most of the times bad bits will be completely silent, sometimes they'll show data corruption, and if you're very unlucky you'll get a crash, or worse, permanent data corruption because some data that was going to be written to disk landed right on top of them

0

0

0

Vlastimil Babka 🇨🇿🇪🇺🇺🇦

vbabka@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @vathpela @guenther Linux kernel has CONFIG_SHUFFLE_PAGE_ALLOCATOR to randomize which memory gets allocated first, which generally distros enable, but probably no one activates it by the boot param page_alloc.shuffle=y ;)

2

0

0

atis

Reply to @raulinbonn@treehouse.systems

@raulinbonn @gabrielesvelto I can concur, that I had RAM modules that would fail memtest brand new.

However, if you want to get some reliable data, I suggest searching for server RAM going bad data.
They have ECC and multiple failover modes (failed addresses get remapped, spare modules, etc). I'm pretty sure some sort of research exists, on how fast RAM chip goes bad.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @atis@toot.lv

@atis @raulinbonn there have been lots of studies about memory failure in servers but they're not very relevant to what users see because of the different conditions they happen in. Servers run in controlled environments, with cleaner power delivery, controlled temperature, lower clocks and shorter lifespans than client devices. So even the same physical DRAM chips will behave differently between a server and a client device.

1

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @vbabka@mastodon.social

@vbabka @gabrielesvelto @guenther all boot params are policy failures :)

0

0

0

Oleksandr Natalenko, MSE

oleksandr@natalenko.name

Reply to @vbabka@mastodon.social

@vbabka If it is not enabled by default, then it is not important.

@gabrielesvelto @vathpela @guenther

1

0

0

Timo the timo

timotimo@peoplemaking.games

Reply to @guenther@chaos.social

@guenther @gabrielesvelto ah yes, bitflips georg who lives in a plutonium mine and gets a thousand bit-flip related crashes every day is an outlier adn should not have counted

0

2

0

The Penguin of Evil

etchedpixels@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto This matches the kind of things I've heard from the Microslop people about their error data.

0

0

0

mhoye

mhoye@cosocial.ca

Reply to @gabrielesvelto@mas.to

@gabrielesvelto it certainly seems like modern operating systems should be smart enough to keep track of what memory regions a program had control of during a crash and file them for opportunistic testing later…

1

0

0

James Just James

purpleidea@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Awesome! But do you report visually to the user that they might have bad ram? That would be a good user experience thing to do. Otherwise they will just get angry at Firefox.

1

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @oleksandr@natalenko.name

@oleksandr @vbabka @gabrielesvelto @guenther no seriously, basing things on boot params should just be considered a bug. It's always a bad choice, usually thought to be necessary because of some other bad choice or trade-off.

1

0

0

David August ❌👑

davidaugust@mastodon.online

Reply to @gabrielesvelto@mas.to

@gabrielesvelto silly question since I am asking way beyond my own knowledge: is there any way to defensively program such that the software won’t crash when bad memory does a bit flip? Like any way to write the code to have a more graceful failure mode?

I realize this might be an astoundingly foolish question, and/or might require a defensive programming paradigm that no only diverges from norms but may be unfamiliar/inaccessible to standard coders.

3

0

0

PsyMar aka Sam

psymar@unstable.systems

Reply to @davidaugust@mastodon.online

@davidaugust You can keep 3 copies of everything in memory, and constantly check them against each other, and best two out of three wins... but that takes 3x as much RAM, and browsers are notoriously heavy on RAM use to begin with. Plus, that's usually a thing done by specialized hardware.

1

0

0

David August ❌👑

davidaugust@mastodon.online

Reply to @psymar@unstable.systems

@psymar that makes good sense, both the technique and the downsides.

1

0

0

comex

Reply to @davidaugust@mastodon.online

@davidaugust @psymar It’s also theoretically possible to design software to be (somewhat) more resilient when memory corruption does happen. But this is not someone anyone does. To the contrary, the trend is towards more and more sanity checks that terminate the program when they fail – everything from bounds checks and assertions to PAC and MTE.

Why? Security. Memory corruption can also be the result of software exploits, and in that context, a crash is a good thing as it stops the attack.

1

0

0

comex

Reply to @comex@mas.to

@davidaugust @psymar But, for example, there are a lot of known bugs in games for the Nintendo 64, GameCube, and Wii, that crash the game on hardware but cause no problems on emulator. Why? Because (for various reasons) the emulators for those consoles ignore certain error conditions that the hardware detects, such as null pointer accesses and floating point exceptions. Often the game runs fine afterward.

On consoles earlier than that, the hardware didn’t have that error checking either.

0

0

0

Michal Špondr

michal@spondr.cz

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Firefox can consume a large amount of memory. I assume that the more RAM programs use (which web browsers do), the more bit flip errors can be expected. Am I right?

1

0

0

Vlastimil Babka 🇨🇿🇪🇺🇺🇦

vbabka@mastodon.social

Reply to @vathpela@infosec.exchange

@vathpela @oleksandr @gabrielesvelto @guenther it's meant for hardening and there's some performance trade-off, which is typical. But IMHO it's better if hardening options can be enabled just by boot parameters and not require a different distro kernel flavor.

1

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @vbabka@mastodon.social

@vbabka @oleksandr @gabrielesvelto @guenther you should be able to turn it on with a running kernel.

3

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @vathpela@infosec.exchange

@vbabka @oleksandr @gabrielesvelto @guenther IMO it's never about boot time vs compile time, and always about being able to turn it on and transition in to it. Of course that's sometimes the hardest way and why we make bad trade-offs, but it also keeps us from being able to enable a lot of features we want on the boot path.

1

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @vathpela@infosec.exchange

@vbabka @oleksandr @gabrielesvelto @guenther Obviously I have some bias here, but it's because people want things from booting that command line variability makes intractable.

1

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @vathpela@infosec.exchange

@vbabka @oleksandr @gabrielesvelto @guenther and other OSes simply do not have this problem at all. It's optional.

0

0

0

GrumpSec Spottycat

kyhwana@furry.nz

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Ohh how long back do you have this data for? Is there any correlation with the solar maximum/minimum cycles?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @kyhwana@furry.nz

@kyhwana only six months because we systematically purge older data as it contains private identifiable information. IIRC there is sometimes a correlation, but it's not particularly significant. There is a very strong correlation with machine age however. I found other interesting correlations however: https://mas.to/@gabrielesvelto/114813152373394985

1

0

0

toadjaune

toadjaune@hostux.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto also, kinda unrelated question, but since you seem quite knowledgeable on this topic :

Consider you just got a passing memtest86+ test. How confident would you be that your memory is in fact, fine ?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @toadjaune@hostux.social

@toadjaune I'd be fully confident that memory works. memtest86+ has an extremely extensive battery of tests that include not only raw memory contents, but also interference patterns, address checks and more. Those can unearth all sorts of problems, from oxidized DIMM contacts to failing memory controllers. I had machines were it could run for 2 full hours and then reliably find an issue at the very tail end of the test, an issue that would be otherwise invisible.

0

0

0

atis

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @raulinbonn I mean, it would give some sort of baseline.
As for timespan, I would disagree, a server running 24/7 for 10 years easily beats any office user (40 hours/week) or typical home user (even less)

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @atis@toot.lv

@atis @raulinbonn in usage hours yes, a server will always beat a client device (well, maybe not an always-on one such as a phone) however there is an extremely strong correlation in our data between machine age and the number of observed failures. Anyway this is an older but still relevant study if you're interested: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf

The rates that they reported are way, way lower than what I see.

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @mhoye@cosocial.ca

@mhoye yes! A kernel would be the perfect place to do this: scan the physical memory associated to a process if it crashes. It could be done probabilistically like we do so as not to do it too much. If we're able to catch a large number of failures in user space with this little effort then a kernel would be able to catch the vast majority of actual problems.

1

1

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @davidaugust@mastodon.online

@davidaugust no, it's impossible. Errors could happen anywhere, including in the code of the program, in the data you're reading from disk or receive from the network (because it ends up in memory), in the structures used by the kernel, etc... The only solution is hardware error detection and correction.

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @purpleidea@mastodon.social

@purpleidea we want to do that but the UI and UX design is a remarkable challenge. We don't want it to look like the many, many scams on websites that pop-over messages telling you your machine is borked.

2

0

0

Stuart Longland (VK4MSL)

stuartl@longlandclan.id.au

Reply to @davidaugust@mastodon.online

@davidaugust @gabrielesvelto Servers normally use ECC memory (Error Correction Code) which detects such bit flips… but Intel decided that consumers don't need that.

1

0

0

David August ❌👑

davidaugust@mastodon.online

Reply to @stuartl@longlandclan.id.au

@stuartl @gabrielesvelto oh wow. So no longer available?

1

0

0

Stuart Longland (VK4MSL)

stuartl@longlandclan.id.au

Reply to @davidaugust@mastodon.online

@davidaugust @gabrielesvelto Never offered to consumers in the first place.

If you want ECC RAM, you buy a server.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @stuartl@longlandclan.id.au

@stuartl @davidaugust you can build desktop-class machines with ECC memory by picking the right CPU, the right motherboard and unbuffered ECC UDIMMs. I've been using desktop machines with ECC memory since 2012, the current one - my main workstation - has two 48 GiB unbuffered ECC DDR-5 sticks on a an ASUS PRIME B650-PLUS and a Ryzen 9700X. Before the RAMpocalypse began ECC UDIMMs were a bit more expensive than regular ones but not so much as to make them inaccessible.

0

0

0

Ben Pye

benpye@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto this definitely makes me feel better about my decision to use ECC memory for my desktop I built in 2024. I just wish there were more notebook options with ECC support.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @benpye@mastodon.social

@benpye yes, there are so few of them. And what's maddening is that the rise in soldered RAM mean that machines that have CPUs with ECC support will never actually use it because you can't change the RAM (or even order it when new)

0

0

0

Tech Tyrant ᴶᴶᴳᵃᵈᶢᵉᵗˢ

jj@social.jjgadgets.tech

Reply to @gabrielesvelto@mas.to

@gabrielesvelto can I run the bit-flip tester independently? 🤔

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @jj@social.jjgadgets.tech

@jj no, but you can use one of the dedicate tools I pointed to in this post which are a lot more effective than our simple tester: https://fosstodon.org/@gabrielesvelto/112407745077972912

0

0

0

Diane

alienghic@timeloop.cafe

Reply to @gabrielesvelto@mas.to

@gabrielesvelto

Would there be any way of getting to that memory diagnostic information?

Because I've got an intel 13th gen laptop where firefox crashes constantly, and I'm trying to convince the vendor it's the main board.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @alienghic@timeloop.cafe

@alienghic if you've got a Raptor Lake-based machine check if you've updated the BIOS and got the latest CPU microcode, but even then it could be your CPU acting up. See my old thread on the topic: https://mas.to/@gabrielesvelto/115939583202357863

0

0

0

Gary "grim" Kramlich

grimmy@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Is the firefox crash reporting stuff open source and available anywhere? I need to get something similar setup for @pidgin and obviously would love to be able to reuse existing work 😅

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @grimmy@mastodon.social

@grimmy @pidgin absolutely! We used to rely on Google's projects but reusing them was hard so I've been driving a large effort of replacing them with easy-to-integrate tools. Here's the relevant crates:

https://github.com/rust-minidump/rust-minidump
https://github.com/rust-minidump/minidump-writer
https://github.com/mozilla/dump_syms/
https://github.com/EmbarkStudios/crash-handling

The last one is for integrating stuff into your project. The code we use to integrate this stuff in Firefox is still too tightly bound to our machinery but I plan on moving it out soon.

1

0

0

Jacob 0xFFF

me@mastodon.jacobwhite.us

Reply to @gabrielesvelto@mas.to

@gabrielesvelto aren’t some bit flips caused by cosmic rays?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @me@mastodon.jacobwhite.us

@me yes but they're probably a vanishingly small amount, unless we're talking about computers operating at very high altitudes or in space where those rays become a real problem

0

0

0

L. David Baron

dbaron@w3c.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto It's good to see real numbers for this. I think I once made a guess of 5%-50%, so the magnitude of the numbers isn't surprising to me.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @dbaron@w3c.social

@dbaron yes, if you want to explore our data just check the crash reports that have the `possible_bit_flips_max_confidence` value set: https://crash-stats.mozilla.org/search/?possible_bit_flips_max_confidence=!__null__

As you'll notice the bulk of the crash signatures happen in code that touches a lot of memory such as the GC or large hash-tables. Unfortunately I don't have good numbers for Rust code because most of the crashes there will happen in a controlled way and I haven't figured out a way to detect them reliably.

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to

@stark @raulinbonn I disagree, even simple SECDED ECC would significantly lengthen the lifetime of consumer electronics. It takes a while before a system develops multiple bit failures within the same chunk that cannot be corrected (unless it's a catastrophic failure such as an entire bit/word-line going bust)

0

0

0

ePirat

ePirat@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Do you have technical details about that? Is it a feature in breakpad or is Firefox nowadays using something new to generate the minidumps?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @ePirat@mastodon.social

@ePirat we do it on the processing side, this has been in rust-minidump for years which IIRC you already use. Check the 'possible_bit_flips' entry in the schema for the output of `minidump-stackwalk`: https://github.com/rust-minidump/rust-minidump/blob/main/minidump-processor/json-schema.md

As for the memory tester it's part of the crash reporter client: https://searchfox.org/firefox-main/source/toolkit/crashreporter/client/app/src/memory_test.rs

It's still designed around Firefox but you could extract the whole crate (`crashreporter`) and reuse it. I plan on moving it out of our tree and make it reusable soonish.

0

0

0

Netscape Navigator

NetscapeNavigator@vivaldi.net

Reply to @gabrielesvelto@mas.to

@gabrielesvelto

Do you know why Firefox lost the browser wars?

Because they always blamed everyone else — always.

It was always a browser extension’s fault, a hardware fault, or the user’s fault. Meanwhile, Chromium works — that’s it — it works.

You lose, and worst of all, you don’t even own up to it.

#Firefox #Mozilla #Vivaldi #VivaldiBrowser

1

0

0

Magical Grrrl Betty

magicalgrrrl@indiepocalypse.social

Reply to @NetscapeNavigator@vivaldi.net

@NetscapeNavigator @gabrielesvelto *totally* didn't have anything to do with the power/money of Google behind Chromium.

it's hella ironic a Netscape fanboy account is encouraging the death of browser diversity.

1

0

0

Magical Grrrl Betty

magicalgrrrl@indiepocalypse.social

Reply to @magicalgrrrl@indiepocalypse.social

@NetscapeNavigator @gabrielesvelto also SUPER ironic considering Firefox can be kinda considered a successor to Navigator in many ways

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @magicalgrrrl@indiepocalypse.social

@magicalgrrrl Firefox is not only a successor of Navigator, it's literally the same codebase that evolved into it. A lot of functions in our codebase are still prefixed with `NS_` or `ns` (for Netscape). You can still find traces of when the thing ran on OS/2 or Windows 95 if you search hard enough: https://searchfox.org/firefox-main/rev/22d04b52b0eb8d9fa11bf8ede5ccc0243a07c5ba/nsprpub/config/rules.mk#80

0

0

0

Roy Tam

roytam1@miniwa.moe

Reply to @gabrielesvelto@mas.to

@gabrielesvelto I wonder how many reports are from DDR5 platforms? as DDR5 comes with ECC I wonder if that helps?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @roytam1@miniwa.moe

@roytam1 DDR5 has ECC only over transfers from the DRAM chips to the memory controller, unfortunately it doesn't have ECC covering the memory itself

1

0

0

Viraptor

viraptor@cyberplace.social

Reply to @gabrielesvelto@mas.to

Edited 3 months ago

@gabrielesvelto
Won't this misclassify issues which originally come from (some_struct_t*)(corrupted_address)->flag=value;
?
And a few cases of *bad_addr += 2^n ? (If the bad address happens to point at a pointer)
Or am I misunderstanding the check?
@natevw

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @viraptor@cyberplace.social

@viraptor @natevw we're doing the check on the base address to avoid issues with offsets. We disassemble the crashing instruction, substract the offset (whether it comes from a literal or another register) then do the computation on the base address. Off-by-one errors to aligned buffers can lead to false positives and that's something I intend to address, but they're not very common https://github.com/rust-minidump/rust-minidump/issues/960

0

0

0

LisPi

lispi314@udongein.xyz

Reply to @gabrielesvelto@mas.to

@gabrielesvelto I wonder if Macs at least have something like Linux's `memtest` kernel argument for mapping out & substracting known-bad memory regions.

RIP otherwise for that terribly expensive hardware that yet still cheaped out on ECC memory.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @lispi314@udongein.xyz

@lispi314 not really, but the XNU kernel has at least some checks for bit-flips that log errors when they're encountered within an executable: https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/bsd/kern/kern_exit.c#L1989-L2035

0

0

0

Simon Zerafa

simonzerafa@infosec.exchange

Reply to @gabrielesvelto@mas.to

@gabrielesvelto

Any possibilities that these bit flips are row hammer type issues? 🙂

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @simonzerafa@infosec.exchange

@simonzerafa some could be but it's hard to tell. Given there's a very strong correlation between the age of the affected machines and the errors I tend to think that's mostly plain flaky hardware.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @michal@spondr.cz

@michal users with bad memory will experience general instability. Firefox is a good candidate for stumbling upon them and crashing because it generates large amounts of code (JIT compiling JavaScript) and manipulates large amounts of structured data. The garbage collector is the code most likely to hit them. Generally speaking a user with bad memory will experience random instability and data corruption as time goes by.

0

0

0

polly

pollyglot@tech.lgbt

Reply to @gabrielesvelto@mas.to

@gabrielesvelto I'd love to read a blog post about that analysis method you designed, I'm really curious

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @pollyglot@tech.lgbt

@pollyglot it's not at all complicated, check it out https://mas.to/@gabrielesvelto/116172856165097305

1

0

0

prom™️

promovicz@chaos.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto cool effort! should go to research, maybe? so far, I’ve seen this treated at scale for infrastructure, but much less for client devices.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @promovicz@chaos.social

@promovicz yes, I'd love to do a proper article on this. Unfortunately I'm too busy working on this machinery with my team and keeping the lights on to find the time for proper research blobcatnotlike

0

0

0

Joxean Koret (@matalaz)

joxean@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto can you talk about the heuristics used to detect bit flips?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @joxean@mastodon.social

@joxean it's very simple, check the links here: https://mas.to/@gabrielesvelto/116172856165097305

0

0

0

Bindestrich

Bindestriche@social.anoxinon.de

Reply to @gabrielesvelto@mas.to

@gabrielesvelto will I as a user be notified that I maybe should do a memory check?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @Bindestriche@social.anoxinon.de

@Bindestriche the crash reporter client will do it automatically if the entire browse crashes, but it won't tell the user, we don't have proper user interface features for that yet

0

0

0

polly

pollyglot@tech.lgbt

Reply to @gabrielesvelto@mas.to

@gabrielesvelto surely it will have some false positive (as in flags a crash as being from a hardware bit flip when it isn't) rate? For example, a bug might accidentally add some large number of bytes to an address such that it goes out of range of the mapped regions, but in a way that a bit flip can remove 2^n from that address and move it back into a valid range? I think I don't have the intuition to say how often that would happen but some simple stochastic modelling could probably model how likely this all is

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @pollyglot@tech.lgbt

@pollyglot yes it can, but it proved to be very rare in practice. This is the only real case that happens with a certain frequency and I'll get around to fixing it soon enough: https://github.com/rust-minidump/rust-minidump/issues/960

0

0

0

Simon Zerafa

simonzerafa@infosec.exchange

Reply to @gabrielesvelto@mas.to

@gabrielesvelto

Has there been any discussion of a Firefox Labs experiment to gather more information? 🙂

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @simonzerafa@infosec.exchange

@simonzerafa not yet, we're busy with so many things blobcatnotlike

0

0

0

campfireman

campfireman@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Really cool insight! Now I understand why Linus Torvalds was so angry with Intel making ECC an enterprise feature.. I wonder whether your metric is correlated with solar events

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @campfireman@mastodon.social

@campfireman not really, but it's strongly correlated with the age of the machine

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @gabrielesvelto@mas.to

@campfireman I also found correlations to heat waves for other type of hardware issues: https://mas.to/@gabrielesvelto/114813152373394985

0

0

0

Klaus Frank

agowa338@chaos.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto
Were you able to find a way to fingerprint issues like e.g. Intel 13 and 14 gen partially killing itself?

I had an issue where every time the system went into idle state it would cause memory errors and applications to crash with SIGINT and SIGSEGV. That issue at the time was almost impossible to pinpoint for me and I ended up replacing all components until I finally replaced the CPU.

That was in part because load would make the system stable, so all memtests were fine...

2

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @agowa338@chaos.social

@agowa338 we have different type of heuristics to detect CPU bugs too: https://github.com/rust-minidump/rust-minidump/blob/26cd5dd6eb0ddfb29bcb7ed2d8c75cd8e6368203/minidump-processor/src/processor.rs#L838

And an entire collection of bugs that we know happen only on Raptor Lake (we stopped counting at some point as our automatic system was filing so many of them): https://bugzilla.mozilla.org/show_bug.cgi?id=1975808

2

0

0

Klaus Frank

agowa338@chaos.social

Reply to @agowa338@chaos.social

@gabrielesvelto
(Until I was able to get new hardware I at some point had a crypto miner running in the background as the load it caused made the system stable. And that alone made me pull out my hair as it didn't appear to make any sense)

0

0

0

Hugo Mills

darkling@mstdn.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto I don't have any numbers on this, but anecdotally, this seems to match with the experience from btrfs. The filesystem does a lot of consistency checking, and is very good at detecting corruption from bad hardware. We see a lot of problem reports on IRC where we can confidently identify it as bad hardware.

It's not always the RAM. Sometimes it's the power supply, or power regulation on the motherboard: the "bad bits" move around in those cases.

And then there's the disks...

0

1

0

Klaus Frank

agowa338@chaos.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto

You don't have a standalone version of all of that tooling that just outputs a text file with "check xyz: Passed/Failed"?

That would be quite useful for a lot of cases actually.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @agowa338@chaos.social

@agowa338 yes, you can build a tool called `minidump-stackwalk` from that codebase. It does the analysis (among other things) but you need a minidump taken from a crash report first to analyze (other applications generate those: Chrome, Steam, Unity, Windows Error Reporting, etc...). This tool woks with all of them.

1

0

0

Klaus Frank

agowa338@chaos.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto

A Firefox crash report I assume? Or would any minidump from a crash of any application work?

Also do you know how many of the checks would work without one (or be able to take one itself)?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @agowa338@chaos.social

@agowa338 any crash report that uses minidump, lots of Windows applications do. If you're experiencing frequent instability on your system you could enable local dumping by WER and then analyze the minidumps it takes yourself:

https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps

0

0

0

guenther

guenther@chaos.social

Reply to @vathpela@infosec.exchange

@vathpela How would that even work? Any memory allocated up until the point where you switch the setting will have to remain in place, negating the benefits or randomized allocation for everything that starts early. And both allocators would have to work with the same in-memory format, which may or may not be possible?

@vbabka @oleksandr @gabrielesvelto

1

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @guenther@chaos.social

@guenther @vbabka @oleksandr @gabrielesvelto right, like I said there are always difficulties and trade-offs. You might have to flip the switch and then re-start tasks or kexec or other things, or something else (who knows, someone would have to design it).

0

0

0

datum (n=1)

datum@zeroes.ca

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @mhoye

scan the physical memory associated to a process if it crashes. It could be done probabilistically like we do so as not to do it too much.

Oh wow, I hope processes are not crashing so much that there's only enough resources to sometimes check their hot memory for hardware errors!

1

0

0

mhoye

mhoye@cosocial.ca

Reply to @datum@zeroes.ca

@datum @gabrielesvelto It doesn't need to be a frequent-crash thing. If it's a lazy stochastic approach, the idea is basically a self-healing system.

0

0

0

James Just James

purpleidea@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto You gotta tell the user somehow.

0

0

0

Samantaz Fox

SamantazFox@infosec.exchange

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @roytam1 It's the other way around: DDR5 has on-chip ECC, but no ECC lines between the RAM chips and the memory controller. That's why "real" ECC needs compatible sticks, motherboard *and* CPU.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @SamantazFox@infosec.exchange

@SamantazFox @roytam1 you're right, I mixed it up with CRC on transfers

0

0

0

Howard Chu @ Symas

hyc@mastodon.social

Reply to @vathpela@infosec.exchange

@vathpela @vbabka @oleksandr @gabrielesvelto @guenther sounds like a stupid comment. A running kernel has already completed the majority of memory allocations it will ever need, so toggling such an option by then would have no effect. Unless you want such toggling to force a complete realloc of all kernel memory, which would be even more stupid.

2

0

0

Oleksandr Natalenko, MSE

oleksandr@natalenko.name

Reply to @hyc@mastodon.social

@hyc And while you are solving this, make it possible to page out kernel memory :D.

@vathpela @vbabka @gabrielesvelto @guenther

0

0

0

Gary "grim" Kramlich

grimmy@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @pidgin dang, we've been actively trying to not pull rust into our already overly complicated builds, especially on windows :-/

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @grimmy@mastodon.social

@grimmy @pidgin you can use the client-side tools without pulling them into your build. For generating minidumps you can use Google Breakpad or Crashpad which are both C++, but they're not very easy to integrate in a non-Google project.

0

0

0

stilescrisis

stilescrisis@mastodon.gamedev.place

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @rygorous I've been down this road! I worked on the installer for the original World of Warcraft. It decompressed and recompressed assets on the fly during installation. This was a serious problem for us! It turns out that running Bzip2 on a machine for an hour is a real stress-test.

1

0

0

stilescrisis

stilescrisis@mastodon.gamedev.place

Reply to @stilescrisis@mastodon.gamedev.place

@gabrielesvelto @rygorous It was surprising how much user pushback we would get, as well. "I've run this machine overclocked for a year and it's totally stable" etc.

1

0

0

Fabian Giesen

rygorous@mastodon.gamedev.place

Reply to @stilescrisis@mastodon.gamedev.place

@stilescrisis @gabrielesvelto Yeah it's the worst.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @rygorous@mastodon.gamedev.place

@rygorous @stilescrisis I've had crash reports with comments like "Firefox is crashing all the time, but yesterday Baldur's Gate 3 also crashed, maybe it's related?"

1

0

0

mauvedeity

mauvedeity@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto off topic, and I’m very sorry, but I love Firefox and don’t know how else to reach out. I get a lot of crashes when I run a shared Firefox binary that I use from three accounts on macOS. I have configured them so only my main account should be updating, but even so it generally crashes on shutdown.

How can I report this?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @mauvedeity@mastodon.social

@mauvedeity I'm literally in charge of Firefox stability so you couldn't have chosen a better contact. Does the crash reporter show up and did you submit the crashes? If yes navigate to about:crashes and check a few of the links there. Send them to me and I'll figure out which bug they belong to

0

0

0

Josh Simmons

dotstdy@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @rygorous @stilescrisis meanwhile, manufacturers: you need to enable the xmp overclock profile to set the ram to the 1 morbillion giggahurts that was on the box

0

0

0

GrumpSec Spottycat

kyhwana@furry.nz

Reply to @gabrielesvelto@mas.to

@gabrielesvelto neat!
Could you track just the number of crashes/etc across longer time frames without any PII? I bet there could be lots of interesting patterns in there

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @kyhwana@furry.nz

@kyhwana yes, we can do that with crash telemetry though it has significant limitations as it's purged of anything that's even remotely PII

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to

@wyatt @pidgin @grimmy I don't remember. There are definitely some linters that required it but I'm not sure about the build. In general we use xpcshell to run headless JS, but the whole thing is such a large contraption that there might be something that depend on node.

0

0

0

Farce Majeure

vathpela@infosec.exchange

Reply to @hyc@mastodon.social

@hyc @vbabka @oleksandr @gabrielesvelto @guenther you have completely missed the point and decided to do some name calling. Classy.

1

0

0

Howard Chu @ Symas

hyc@mastodon.social

Reply to @vathpela@infosec.exchange

@vathpela @vbabka @oleksandr @gabrielesvelto @guenther your point has no merit.

1

0

0

Vlastimil Babka 🇨🇿🇪🇺🇺🇦

vbabka@mastodon.social

Reply to @hyc@mastodon.social

@hyc @vathpela @oleksandr @gabrielesvelto @guenther incorrect, it has mine.

1

0

0

Howard Chu @ Symas

hyc@mastodon.social

Reply to @vbabka@mastodon.social

@vbabka @vathpela @oleksandr @gabrielesvelto @guenther eh? Your earlier comment is in direct opposition to his.

0

0

0

Pavel Machek

pavel

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Dunno. If the Firefox was bug free, 100% of all crashes would be caused by hw defects! There's still some way to go :-)

And ... you should be expecting about two "single event upsets" a year on normal hardware, due to cosmic rays only. It is fairly hard to defend against those, so we either ignore them completely (desktops) or add ECC RAM, which is partial solution (servers).

https://en.wikipedia.org/wiki/Single-event_upset

Proper solution fairly expensive, and I have not seen it outside of class-B aircraft systems.

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to

@floorislava I have a whole thread about those

https://mas.to/@gabrielesvelto/115939583202357863

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @gabrielesvelto@mas.to

@floorislava also https://mas.to/@gabrielesvelto/114813152373394985

0

0

0

James Just James

purpleidea@mastodon.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Like at a minimum have it date >> ~/.mozilla/firefox/memtest_fail or something, come on!

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @purpleidea@mastodon.social

@purpleidea well, yes, we could definitely do that, or store it so that it can be reached from about:support

0

0

0

Kira

nowwhat@lgbtqia.space

Reply to @gabrielesvelto@mas.to

@gabrielesvelto How many of these could be transient bitflips? I remember a blog post about solar radiation causing problems... somewhere...

2

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @nowwhat@lgbtqia.space

@nowwhat I can't be sure but I believe not many. The biggest correlation I have in this data is with machine age, which suggests that it might be mostly memory failing over time

0

0

0

Kira

nowwhat@lgbtqia.space

Reply to @nowwhat@lgbtqia.space

@gabrielesvelto Found it: https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-what-flips-your-bit/

0

0

0

Adrian Chadd

erikarn@mstdn.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto @agowa338 are the raptor lake crashes the known big/little cache /tlb bugs that got fixed later?

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @erikarn@mstdn.social

@erikarn no, only a handful seem to have been fixed. The bulk of those crashes are people suffering from the voltage issues and the permanent damage it may cause

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to

@gaikokuking I'm thinking about how to do that. The biggest issue is that we don't want it to look like the many scams pretending there's something wrong with your machine

0

0

0

schuelermine

anselmschueler@ieji.de

Reply to @gabrielesvelto@mas.to

@gabrielesvelto is there a plan to show this result to the user? Might be useful.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @anselmschueler@ieji.de

@anselmschueler yes, we have to figure out how without looking like an internet scam (your machine is borked, click here to fix it!)

0

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to

@storagenerd it is similar to actual test programs such as memtest86+ in that it attempts a few different tests to root out problems, but it's much, much simpler, only runs for 3 seconds at most and doesn't scan more than 1 GiB of memory. See the source here: https://searchfox.org/firefox-main/source/toolkit/crashreporter/client/app/src/memory_test.rs

0

0

0

highvoltage

highvoltage@pleroma.debian.social

Reply to @gabrielesvelto@mas.to

@gabrielesvelto In the old days MS-DOS used to do a quick memory check when loading himem.sys. This used to annoy me because on systems with more than 4MB of RAM, this can take a while. So there's an option to disable it, and in the DOS manual where this setting is, they explain that they had to enable it by default because the very large amount of memory out there is flaky, and lots of issues are caused by it. It seems that not much has improved, ECC should become default these days.

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @highvoltage@pleroma.debian.social

@highvoltage and at that time you could still buy memory with parity! There was still an expectation that client systems would have the option of having at least a check that everything was fine

0

0

0

Laurens

ElBeeToots@mastodon.nl

Reply to @gabrielesvelto@mas.to

@gabrielesvelto Fun fact: Since Firefox v146 was released recently , I have experienced regular browser tab crashes which I hardly, if ever, had before v146. It has marginally improved in v147 and v148 but in v148.0.2 I just had two tab crashes in 10 minutes.

I have a self-built PC from 2024 with 32GB of RAM, which should be plenty and modern enough.

Interestingly, the LibreWolf fork, which is on v148.0.1 currently and which I have configured 100% the same, does not crash tabs. 1/x

1

0

0

Laurens

ElBeeToots@mastodon.nl

Reply to @ElBeeToots@mastodon.nl

@gabrielesvelto Of course N=1 but I would suspect the culprit isn't my system or its RAM, but Firefox itself.

The fact that a fork that's stripped of the AI bloat that Mozilla vehemently wants to shoe-horn into Firefox has no stability problems whatsoever, makes me suspect that there's something else causing the problems than 'bit flips' or any other reason you're trying to blame for Firefox's tab crash problems. 2/x

1

0

0

Gabriele Svelto

gabrielesvelto@mas.to

Reply to @ElBeeToots@mastodon.nl

@ElBeeToots LibreWolf completely disables crash reporting, so in many cases the browser will just silently restart crashed processes. That being said if you send me the links to the crash reports we can figure out what's wrong with it.

0

0

0

About social.kernel.org

Terms of service

Please do not use this service in violation of the Linux Kernel Code of Conduct. Doing so will result in your account suspension with the referral of the matter to the CoC committee.
"Repeating"/"boosting" someone else's status on this platform will be treated as endorsement and will fall under rule #1.
You are encouraged to use this platform to promote your work on the Linux Kernel, but there is no restriction on permitted topics (with the exception of anything covered by #1 above).
There is no requirement to post in English, but it should be considered the primary language of communication on this platform.

Privacy notice

The admins of this service have access to all posted statuses. They aren't looking, but if it's something they shouldn't know about, then you should not post it on this platform.

Please see the Linux Foundation Privacy Policy, which applies to this platform as well.

Getting your own account

If you would like an account on this instance, please check that the following applies to you:

You are listed in MAINTAINERS or CREDITS
OR: You have a kernel.org account or email address
OR: You have a long and established history of involvement with the Linux Kernel

If the above is true and you agree with the Terms of Service and Privacy Notice listed above, please use these instructions to request an account:

How to request an account on social.kernel.org