A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5
In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger. Still, out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much. 2/5
In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%. This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem. 3/5
And to reinforce this estimate I've looked at the numbers we got from the users who run the memory tester after having experienced a crash: for every two crashes we think are caused by a bit-flip the memory tester found one genuine hardware issue. Keep in mind that this is not doing an extensive test of all the machine's RAM, it only checks up to 1 GiB of memory and runs for no longer than 3 seconds... and it has found lots of real issues! 4/5
And for the record I'm looking at this mostly on computers and phones, but this affects *every* device. Routers, printers, etc... you name it. That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job. 5/5
@gabrielesvelto is it lots of different devices, each one experiencing rare crashes at random, or is there a small number of really shitty computers accounting for a large share of the crashes?
@gabrielesvelto and what is the ratio of people who ever get a (bit-flip) crash out of all those who opted into this telemetry?
@gabrielesvelto hopefully those MacBooks could run Linux with the badram option.
@guenther I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate.
@guenther generally speaking a single machine won't send a lot of crashes. It's very common that they only have one bad bit across their whole installed RAM. They'll hit it eventually, especially if it's in the lower address ranges, but not all of the time. And in order to crash some important data needs to end up there, like a pointer or an instruction.
@gabrielesvelto As a personal anecdote, I built a PC in 2017 with 16 GB of DDR4 RAM that I got from Amazon (Germany.) Had to return it after extensive testing with Passmark's free version of memtest86. Had failing bits. The replacement did pass the heavy testing. If there's one thing I wanted that PC to be was very stable and reliable.
Few years later got a second 16 GB bit to expand it to 32 GB. Had to return that kit as well, it also had errors. The replacement again passed the extensive testing. This is still the PC I'm writing from now in fact.
Manufacturers and their QA teams must be aware of their failure rates, but they likely do not care to save costs and make higher profits. They still sell kits with some failures, because not many users subject their PCs/RAM to the torture of these long RAM tests (4 full passes or more for sanity's sake takes hours.) And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately. From my experience, the "RAM Test" offered by Windows was an absolute joke. It never found anything on the kits that Memtest86 would find failures on in about 1 of any 2 runs.
I remember watching a Youtuber testing a gaming build he had just put together, and he used prime95 to test it for some minutes only. The computer did not crash and according to him that was fine enough for a gaming PC. I happen to disagree. In particular because in that run of his, even if Prime95 did not crash, it showed calculation error warnings. That could have happened because of RAM issues. In his view, just for gaming it was fine enough that Prime95 would not crash quickly, much better even if it endures some minutes. I disagree. Any calculation error from Prime95 is quite a hardware stability/reliability red flag, just as any finding from memtest86.
It is a failure of the industry that ECC RAM is still not standard at least for PCs, laptops, and cellphones. Maybe it should be standard for all consumer electronics in fact.
@raulinbonn yes, both hardware and big software vendors have handwaved this problem away for years by claiming that software bugs are more common. In my testing hardware issues are common enough that they often drown the software issues.
👾
@gabrielesvelto I did not know that bit flips refer to reproducible bad ram issues....I thought they are random...
@adingbatponder people and research usually focused on random bit-flips caused by high-energy radiation and similar phenomenons. Actual RAM going bad is a poorly documented and researched problem, mostly because the industry doesn't care. This is a more extensive thread on the issue: https://fosstodon.org/@gabrielesvelto/112407741329145666
@gabrielesvelto How can I inspect this data on my local machine? Because I am suspecting I have a bit-flippy bit of memory :-/
@derickr check out the tools I linked to in this post: https://fosstodon.org/@gabrielesvelto/112407745077972912
@gabrielesvelto @adingbatponder At the end of that thread: " I'd also like to point out that we've got preliminary data on the topic, but I fully intend to write a proper article with a detailed analysis of the data. 17/17"
Was that article published, or is it approaching publication? I'd be very interested.
@raulinbonn @adingbatponder I never had the time to write it, it's on my TODO list for this year
@gabrielesvelto i have found every one of your discussions of this topic immensely fascinating and have been able to revise many assumptions i had about the cpu and memory system. i want to additionally commend you for both identifying that more invasive telemetry could have been useful and then making it unequivocal that it's always opt-in and still anonymized on top of that. i have had to push back very strongly on this sort of thing before and it takes my breath away to find someone else with extremely high standards for measurement work and user safety
@gabrielesvelto My mind was going to cheap low-end hardware, but now you’re throwing expensive Apple Silicon SOC’s in the mix, it’s a bit harder to believe that they suffer from bitflips at the rates your are implying.
@stevenodb @gabrielesvelto low-end hardware might sometimes even be less likely to hit this because it's not even trying to be super fast. High-end hardware chasing the fastest speeds is pushing the limits of stability all the time.
@gabrielesvelto I work on a phone app with a very large install base and a decent crash rate. The crash reporter is just a sea of single-instance crashes. Crashes we see once, ever, and then never again.
During new version rollouts I've introduced the rule that "one crash is zero crashes". That is, don't even think about investigating a new crash in a new release until there are at least two of them.
I've always assumed that these are bit flips, so really good to see some evidence!
@WAHa_06x36 yes, keep in mind that in our case I've worked on some of this stuff because our signal-to-noise ratio in crash reports was getting awfully low, with low-volume valid crashes being swamped out by bad hardware
@gabrielesvelto so if you halved Firefox’s memory footprint you’ll reduce crashes by at least 7%?
@bnut not really, out-of-memory crashes are bizarre unintuitive beasts. For starters they only ever happen on Windows, never on macOS and very rarely on Linux. When they happen it's not because the user actually ran out of memory, it's because they ran out of commit-space. See a discussion on it and how we reduced OOM crashes by some ~80% a few years ago in this article: https://hacks.mozilla.org/2022/11/improving-firefox-stability-with-this-one-weird-trick/
@gabrielesvelto On Linux at least it is possible for the kernel to quarantine the physical pages containing bad bits. At a previous job we used this to remotely repair expensive appliances that would have required an onsite technician to swap out the whole unit.
It’s not widely used, and pinpointing the bad pages isn’t easy (or possible from userspace, afaik). But maybe now that DRAM is expensive again it could be improved.
@ericseppanen I would argue for widespread use of ECC memory. The price of doing it in hardware would be small, certainly smaller than the damage caused by bad memory. But even if one doesn't want to pay it there's always the possibility of using inline ECC in modern SoCs that have both an integrated memory controller and caches: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/psdk_rtos/docs/user_guide/developer_notes_ddr_inline_ecc.html
@stevenodb high-end hardware pushes the limits of semiconductors both because of small feature size and high clocks increasing the chances of failure. In the case of DRAM the chance of malfunction increases with higher temperature too, as the ability of trench/stacked capacitors to retain charge degrades with it... and Apple puts its DRAM right next to the CPU, the single hottest place of the whole device.
@gabrielesvelto this makes me think
but of those the most fascinating outcomes to me are that
@datum yes, absolutely. Coincidentally the bulk of Firefox code was compiled for size, not speed by default, as smaller code proved faster in such a large codebase. Nowadays it's a complex PGO/LTO dance but the focus on small executable footprint has remained.
> And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately.
Are these people aware that a bit flip in some file system code could nuke somebody's hard drive?
@argv_minus_one @raulinbonn yes, the worst outcome of a bit-flip is when data that will be written to disk happen to overlap it, which then makes it all the way to the drive. And BTW this is one of the reasons why competent filesystems should always implement checksums for both data and metadata, as it increases the chances of detecting these issues early, before they do permanent damage.
@gabrielesvelto @adingbatponder I had a coworker whose PhD was analyzing bit errors, and concluded that running without at least ecc ram, particularly in a data center setting was madness. But it had serious repercussions for users doing data analysis on their (non ecc) desktops for research.
@trouble @adingbatponder yes, at the datacenter level the amount of errors you get is enormous. SECDED ECC doesn't cut it there anymore so usually more robust detection/correction systems are used.
@gabrielesvelto very cool! is the logic open source somewhere you can link, or [even more lazily 😇] is the heuristic easy for you to summarize in broad strokes?
@natevw absolutely! The logic is very simple in principle and described here: https://bugzilla.mozilla.org/show_bug.cgi?id=1738651#c0
We tweaked it and rely on disassembly of the executing instruction to more accurately calculate the real address to test for bit-flips. You can find the code here, in our crash analysis tooling:
@gabrielesvelto holy shit, so what, you're just writing a gig of data and reading it back? and that works well enough?!
@dysfun yes, believe me when I tell you that I was as surprised as you are. We try to do a few different tests (like writing different patterns before reading them back) but in a nutshell, that's it. Write, read back, check if it matches.
@gabrielesvelto I'm curious, are you considering displaying this information to the user ?
Like, in the crash report window "it looks like your memory might be faulty, you may want to test it [link to a page explaining how to test memory]"
@toadjaune yes, that's something I'm planning to do. However it needs very careful UI/UX design to avoid looking like one of the many, many scams that pretend your machine is borked.
@gabrielesvelto Couple questions. You said in previous post potentially & heuristic. But in this post you drop that qualifier. That is two qualifiers correct? Also FF should be using only relatively small amount of system RAM correct? So shouldn't statistically, if it was bad was RAM causing the FF problem, the user be seeing a LOT of bad RAM issues system wide?
@CliffsEsport we've had this data for years and it's always been consistent, we even verified it via actual memory testing so I can safely claim that it is indeed real. Additionally we know that we're undercounting these defects, we've designed the system to be conservative so the real numbers are certainly higher, even potentially much higher than what the heuristic tells us.
@gabrielesvelto How does these numbers look if we instead count by unique callstacks? It feels intutive that bad memory can crash anywhere but other crashes are restricted to bugs…?
@breakin that's a very good observation, yes these crashes cluster under some specific bugs. I discovered the problems years ago when my colleagues working on the garbage collector were trying to diagnose crashes that appeared to be impossible, and indeed they were. Because the garbage collector sweeps large amounts of memory containing structured data it was far more likely to encounter a problem than other code.
@breakin other examples of code that frequently hits bad memory is code that manipulates very large collections such as hash-tables. Traversing lots and lots of pointers means that you're more likely to find a broken one.
@gabrielesvelto it's real. we saw this at Twitter while chasing down 404s. Someone wrote a large analysis of strange username typos that couldn't be human error when this paper came out: https://web.archive.org/web/20180713212603/http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinaburg_Bitsquatting_WP.pdf
@sayrer sifting through any significant amount of data sent by machines without ECC memory reveals lots and lots of those issues: https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-what-flips-your-bit/
@gabrielesvelto so will the next new feature in rust be some kind of memory checksumming? Almost like a virtual ECC
@gabrielesvelto might be hard to do as a compiled language feature, but any interpreter/VM could do it
Any compiler or interpreter could do this. After every store, do a clflush, load, compare, and panic if there's a mismatch.
Problem is this would explode executable size (that's a lot of extra instructions) and ruin performance (that's a lot of extra instructions, and also you're effectively disabling the entire CPU cache).
@argv_minus_one @ShadSterling it can be done at the hardware level even in the absence of the extra physical memory required for implementing traditional SECDED ECC. It's called inline ECC and sacrifices a chunk of your memory plus a marginal amount of performance, but it's already available, especially in the embedded market.
@gabrielesvelto Linus Torvalds recently said that ECC memory is very important to him and that he doesn't understand why it's not used everywhere
@adrianmester yes, we've been beating this drum for decades. There are threads where we discussed this on realworldtech's forums that go back to the '10s
@gabrielesvelto curious how many were misconfigured mem profiles / unstable cpu oc / thermals etc. vs. bad hardware
@brotherpsyche impossible to say without more data. I can tell you I routinely stumble upon crashes that are likely caused by overclocking, but I can't be sure because we don't collect this information.
@CliffsEsport and to answer your second question, that's absolutely what happens on user systems. If they have bad RAM they're seeing issues all over the place, including data corruption.
@gabrielesvelto if you want some field data: I have a PC with 4 memory slots. When populated with 2x16GB cards overclocked to their XMP speed, all memory tests pass. When populated with 4x16GB cards overclocked to their XMP speed (all the exact same model), memory tests fails consistently on a very high-address bit. I had to slightly under-volt them below their XMP speed to get tests to pass. I would only ever get crashes on games and other heavy-load processes.
@gabrielesvelto my guess is that a lot of people who enable XMP never test their memory and assume it should "just work." My guess is that it's my chipset which isn't good enough to handle all that memory bandwidth, as tests fail on the same address bit regardless of the permutation of RAM boards. So I am not surprised at all that there's plenty of weird memory bugs out in the field, especially for a popular program.
@brunoph if you see the same hardware bit even by moving around the DIMMs then yes, the problem is likely either in the memory controller or it arises because of interference along the traces that lead to the DIMM slots because of a specific access pattern
@gabrielesvelto Someone on the Chrome team at Google told me years ago they they can diagnose a user’s machine has bad RAM from crash logs.
@shac it's very easy if you collect data that identifies the user. Users with bad RAMs will have plenty of apparently random crashes. In our case we explicitly don't collect identifiable user information, so we need to analyze the crashes themselves. I can tell you with quite a bit of pride that we did this before Google and better than how they do it.
Checksums will let you detect that there is a problem, but won't actually save your data. If a bit flip causes the file system driver to write to the wrong LBA or corrupt a key file system data structure or something, the damage will still be quite permanent unless you have a backup.
@argv_minus_one @raulinbonn yes, detection can only do so much. Even in our case there are indirect crashes that we simply cannot detect. For example in Rust code a bit-flip will often cause an invariant to be broken, so the code will fail cleanly but the origin of the crash will be lost as it happened before the point where we can detect it.
@gabrielesvelto I'm very interested in implementing this at work, but I'm struggling to find more information. Could you explain how you detect possible memory errors or link to an explanation or code?
@lydiafacts it's really quite simple (which is why it typically undercounts the errors), I've described it here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1738651#c0
The code is here: https://github.com/rust-minidump/rust-minidump/blob/26cd5dd6eb0ddfb29bcb7ed2d8c75cd8e6368203/minidump-processor/src/processor.rs#L1463
This is all under a permissive license and easily embeddable, so by all means make use of these tools if you can. I went to great lengths to make sure this would be useful outside of Mozilla, so it could benefit the whole FOSS ecosystem.
@gabrielesvelto ECC DRAM would improve things, but people have been arguing this for decades and if it hasn't happened yet I doubt there's much you or I can do to change that.
On the other hand, if someone with an interest in systems programming wanted to build an always-on bad DRAM detector, I would encourage them to try.
It would be a nice gift to the world to prevent all those existing computers from going to the landfill over a few bad DRAM pages soldered to the mainboard. It might even enable a secondary market for mostly-good-but-slightly-bad DRAM modules to be bought at a discount.
Here's one possible design of an always-on bad-memory handler:
- A kernel API for allocating physical pages.
- A kernel API to quarantine physical pages.
- A userspace daemon that periodically allocates, tests, and releases (or quarantines) physical pages.
- A heuristic for determining if physical addresses are actually bad (and not the result of system flakiness or software bugs).
- A way to persistently store lists of bad pages, and tooling to maintain those lists.
- A way to load the bad pages list at boot (kernel params exist, but the list could be quite large...)
I'm sure a few of these exist (my knowledge is quite out of date), but to the best of my knowledge nobody has put them all together in a form that could be enabled by default and then mostly left to run on its own forever.
@ericseppanen our design could easily be reproduced at the kernel-level and it would work a lot better there. When a process crashes scan the physical memory that was allocated to it with a memory tester, if something bad comes up (permanently) unmap the physical page. The fact that it works well for us in userspace means that in the kernel it would do wonders.
@gabrielesvelto Oh wow, I didn't know it was so prevalent. I got bad ram and I patched it up with linux cmdline stuff. Maybe I should make some kind of GUI to make that process easier for people. I made a little C app that allocates as much as possible and then does some crude ram test with it, and I detected errors without the cmdline and no errors with. The application can't tell me the physical addresses, of course, but it's a good indicator of whether my bad ram was properly skipped.
@starsider yep, the kernel could do this on the user's behalf like we do in userspace. It would work a lot better there too.
@gabrielesvelto @guenther why would the lower address ranges be special? This is confusing to me. I can't imagine firefox runs on any OS without virtual memory (or ASLR), so it doesn't seem like that should correlate strongly with any physical aspect.
@gabrielesvelto By "linux cmdline" I mean the kernel arguments when booting it. Ideally there should be a tool that add these automatically.
@starsider memtest86 & memtest86+ support emitting this information and you can feed it back into GRUB. It's not completely automated by it's the closer you get to that
@vathpela @guenther I meant in the lower *physical* address ranges, because it's more likely to be used early on even on a lightly loaded machine. I once had a laptop with a bad bit at the very end of the physical range, I would hit it only when running Firefox OS builds which were massive (basically building Firefox + a good chunk of Android's base system at the same time)
@gabrielesvelto @guenther That still seems really weird to me - why would firefox be likely to get a low physical address? If anything is likely to have a higher chance of getting that memory, I would think it would be the kernel (which has genuine lowmem requirements sometime for e.g. dma bufs and such on some platforms), but a userland process seems odd.
@gabrielesvelto is there a way I can run that test and see the result during a Firefox crash for me? (Even though most likely cusses of my crashes are out of memory issues 🤪)
@letoams if the entire browser crashes (not just a tab) there will be a checkbox that's on by default which will run the test. Unfortunately we don't have a way to surface the result for users yet. That being said if you fear you might have bad RAM try testing it with memtest86+: https://www.memtest.org/
@vathpela @guenther oh it's not for Firefox specifically. Users with bad bits in lower address ranges will be more likely to encounter problems with *everything*, including the kernel. I also don't literally means the *lowest* ranges. Say, if you have a bad bit in the first GiB of physical memory you'll see its effects far more often than if you have it in the last one on a 32 GiB machine
@gabrielesvelto Interesting thanks! Although sorry, I meant if the memory footprint is smaller wouldn’t that reduce the chance of hitting a flipped bit? Although the program bits are likely only a few hundred megabytes. So the crash reduction wouldn’t scale linearly. I imagine most of it is going to be images or video, where I guess flipping may cause bad display, not a crash.
@bnut indeed, most of the times bad bits will be completely silent, sometimes they'll show data corruption, and if you're very unlucky you'll get a crash, or worse, permanent data corruption because some data that was going to be written to disk landed right on top of them
@gabrielesvelto @vathpela @guenther Linux kernel has CONFIG_SHUFFLE_PAGE_ALLOCATOR to randomize which memory gets allocated first, which generally distros enable, but probably no one activates it by the boot param page_alloc.shuffle=y ;)
@raulinbonn @gabrielesvelto I can concur, that I had RAM modules that would fail memtest brand new.
However, if you want to get some reliable data, I suggest searching for server RAM going bad data.
They have ECC and multiple failover modes (failed addresses get remapped, spare modules, etc). I'm pretty sure some sort of research exists, on how fast RAM chip goes bad.
@atis @raulinbonn there have been lots of studies about memory failure in servers but they're not very relevant to what users see because of the different conditions they happen in. Servers run in controlled environments, with cleaner power delivery, controlled temperature, lower clocks and shorter lifespans than client devices. So even the same physical DRAM chips will behave differently between a server and a client device.
@vbabka @gabrielesvelto @guenther all boot params are policy failures :)
@vbabka If it is not enabled by default, then it is not important.
@guenther @gabrielesvelto ah yes, bitflips georg who lives in a plutonium mine and gets a thousand bit-flip related crashes every day is an outlier adn should not have counted
@gabrielesvelto This matches the kind of things I've heard from the Microslop people about their error data.
@gabrielesvelto it certainly seems like modern operating systems should be smart enough to keep track of what memory regions a program had control of during a crash and file them for opportunistic testing later…
@gabrielesvelto Awesome! But do you report visually to the user that they might have bad ram? That would be a good user experience thing to do. Otherwise they will just get angry at Firefox.
@oleksandr @vbabka @gabrielesvelto @guenther no seriously, basing things on boot params should just be considered a bug. It's always a bad choice, usually thought to be necessary because of some other bad choice or trade-off.
@gabrielesvelto silly question since I am asking way beyond my own knowledge: is there any way to defensively program such that the software won’t crash when bad memory does a bit flip? Like any way to write the code to have a more graceful failure mode?
I realize this might be an astoundingly foolish question, and/or might require a defensive programming paradigm that no only diverges from norms but may be unfamiliar/inaccessible to standard coders.
@davidaugust You can keep 3 copies of everything in memory, and constantly check them against each other, and best two out of three wins... but that takes 3x as much RAM, and browsers are notoriously heavy on RAM use to begin with. Plus, that's usually a thing done by specialized hardware.
@psymar that makes good sense, both the technique and the downsides.
@davidaugust @psymar It’s also theoretically possible to design software to be (somewhat) more resilient when memory corruption does happen. But this is not someone anyone does. To the contrary, the trend is towards more and more sanity checks that terminate the program when they fail – everything from bounds checks and assertions to PAC and MTE.
Why? Security. Memory corruption can also be the result of software exploits, and in that context, a crash is a good thing as it stops the attack.
@davidaugust @psymar But, for example, there are a lot of known bugs in games for the Nintendo 64, GameCube, and Wii, that crash the game on hardware but cause no problems on emulator. Why? Because (for various reasons) the emulators for those consoles ignore certain error conditions that the hardware detects, such as null pointer accesses and floating point exceptions. Often the game runs fine afterward.
On consoles earlier than that, the hardware didn’t have that error checking either.
@gabrielesvelto Firefox can consume a large amount of memory. I assume that the more RAM programs use (which web browsers do), the more bit flip errors can be expected. Am I right?
@vathpela @oleksandr @gabrielesvelto @guenther it's meant for hardening and there's some performance trade-off, which is typical. But IMHO it's better if hardening options can be enabled just by boot parameters and not require a different distro kernel flavor.
@vbabka @oleksandr @gabrielesvelto @guenther you should be able to turn it on with a running kernel.
@vbabka @oleksandr @gabrielesvelto @guenther IMO it's never about boot time vs compile time, and always about being able to turn it on and transition in to it. Of course that's sometimes the hardest way and why we make bad trade-offs, but it also keeps us from being able to enable a lot of features we want on the boot path.
@vbabka @oleksandr @gabrielesvelto @guenther Obviously I have some bias here, but it's because people want things from booting that command line variability makes intractable.
@vbabka @oleksandr @gabrielesvelto @guenther and other OSes simply do not have this problem at all. It's optional.
@gabrielesvelto Ohh how long back do you have this data for? Is there any correlation with the solar maximum/minimum cycles?
@kyhwana only six months because we systematically purge older data as it contains private identifiable information. IIRC there is sometimes a correlation, but it's not particularly significant. There is a very strong correlation with machine age however. I found other interesting correlations however: https://mas.to/@gabrielesvelto/114813152373394985
@gabrielesvelto also, kinda unrelated question, but since you seem quite knowledgeable on this topic :
Consider you just got a passing memtest86+ test. How confident would you be that your memory is in fact, fine ?
@toadjaune I'd be fully confident that memory works. memtest86+ has an extremely extensive battery of tests that include not only raw memory contents, but also interference patterns, address checks and more. Those can unearth all sorts of problems, from oxidized DIMM contacts to failing memory controllers. I had machines were it could run for 2 full hours and then reliably find an issue at the very tail end of the test, an issue that would be otherwise invisible.
@gabrielesvelto @raulinbonn I mean, it would give some sort of baseline.
As for timespan, I would disagree, a server running 24/7 for 10 years easily beats any office user (40 hours/week) or typical home user (even less)
@atis @raulinbonn in usage hours yes, a server will always beat a client device (well, maybe not an always-on one such as a phone) however there is an extremely strong correlation in our data between machine age and the number of observed failures. Anyway this is an older but still relevant study if you're interested: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf
The rates that they reported are way, way lower than what I see.
@mhoye yes! A kernel would be the perfect place to do this: scan the physical memory associated to a process if it crashes. It could be done probabilistically like we do so as not to do it too much. If we're able to catch a large number of failures in user space with this little effort then a kernel would be able to catch the vast majority of actual problems.
@davidaugust no, it's impossible. Errors could happen anywhere, including in the code of the program, in the data you're reading from disk or receive from the network (because it ends up in memory), in the structures used by the kernel, etc... The only solution is hardware error detection and correction.
@purpleidea we want to do that but the UI and UX design is a remarkable challenge. We don't want it to look like the many, many scams on websites that pop-over messages telling you your machine is borked.
@davidaugust @gabrielesvelto Servers normally use ECC memory (Error Correction Code) which detects such bit flips… but Intel decided that consumers don't need that.
@stuartl @gabrielesvelto oh wow. So no longer available?
@davidaugust @gabrielesvelto Never offered to consumers in the first place.
If you want ECC RAM, you buy a server.
@stuartl @davidaugust you can build desktop-class machines with ECC memory by picking the right CPU, the right motherboard and unbuffered ECC UDIMMs. I've been using desktop machines with ECC memory since 2012, the current one - my main workstation - has two 48 GiB unbuffered ECC DDR-5 sticks on a an ASUS PRIME B650-PLUS and a Ryzen 9700X. Before the RAMpocalypse began ECC UDIMMs were a bit more expensive than regular ones but not so much as to make them inaccessible.
@gabrielesvelto this definitely makes me feel better about my decision to use ECC memory for my desktop I built in 2024. I just wish there were more notebook options with ECC support.
@benpye yes, there are so few of them. And what's maddening is that the rise in soldered RAM mean that machines that have CPUs with ECC support will never actually use it because you can't change the RAM (or even order it when new)
@gabrielesvelto can I run the bit-flip tester independently? 🤔
@jj no, but you can use one of the dedicate tools I pointed to in this post which are a lot more effective than our simple tester: https://fosstodon.org/@gabrielesvelto/112407745077972912
Would there be any way of getting to that memory diagnostic information?
Because I've got an intel 13th gen laptop where firefox crashes constantly, and I'm trying to convince the vendor it's the main board.
@alienghic if you've got a Raptor Lake-based machine check if you've updated the BIOS and got the latest CPU microcode, but even then it could be your CPU acting up. See my old thread on the topic: https://mas.to/@gabrielesvelto/115939583202357863
@gabrielesvelto Is the firefox crash reporting stuff open source and available anywhere? I need to get something similar setup for @pidgin and obviously would love to be able to reuse existing work 😅
@grimmy @pidgin absolutely! We used to rely on Google's projects but reusing them was hard so I've been driving a large effort of replacing them with easy-to-integrate tools. Here's the relevant crates:
https://github.com/rust-minidump/rust-minidump
https://github.com/rust-minidump/minidump-writer
https://github.com/mozilla/dump_syms/
https://github.com/EmbarkStudios/crash-handling
The last one is for integrating stuff into your project. The code we use to integrate this stuff in Firefox is still too tightly bound to our machinery but I plan on moving it out soon.
@gabrielesvelto aren’t some bit flips caused by cosmic rays?
@me yes but they're probably a vanishingly small amount, unless we're talking about computers operating at very high altitudes or in space where those rays become a real problem
@gabrielesvelto It's good to see real numbers for this. I think I once made a guess of 5%-50%, so the magnitude of the numbers isn't surprising to me.
@dbaron yes, if you want to explore our data just check the crash reports that have the `possible_bit_flips_max_confidence` value set: https://crash-stats.mozilla.org/search/?possible_bit_flips_max_confidence=!__null__
As you'll notice the bulk of the crash signatures happen in code that touches a lot of memory such as the GC or large hash-tables. Unfortunately I don't have good numbers for Rust code because most of the crashes there will happen in a controlled way and I haven't figured out a way to detect them reliably.
@stark @raulinbonn I disagree, even simple SECDED ECC would significantly lengthen the lifetime of consumer electronics. It takes a while before a system develops multiple bit failures within the same chunk that cannot be corrected (unless it's a catastrophic failure such as an entire bit/word-line going bust)
@gabrielesvelto Do you have technical details about that? Is it a feature in breakpad or is Firefox nowadays using something new to generate the minidumps?
@ePirat we do it on the processing side, this has been in rust-minidump for years which IIRC you already use. Check the 'possible_bit_flips' entry in the schema for the output of `minidump-stackwalk`: https://github.com/rust-minidump/rust-minidump/blob/main/minidump-processor/json-schema.md
As for the memory tester it's part of the crash reporter client: https://searchfox.org/firefox-main/source/toolkit/crashreporter/client/app/src/memory_test.rs
It's still designed around Firefox but you could extract the whole crate (`crashreporter`) and reuse it. I plan on moving it out of our tree and make it reusable soonish.
Do you know why Firefox lost the browser wars?
Because they always blamed everyone else — always.
It was always a browser extension’s fault, a hardware fault, or the user’s fault. Meanwhile, Chromium works — that’s it — it works.
You lose, and worst of all, you don’t even own up to it.
@NetscapeNavigator @gabrielesvelto *totally* didn't have anything to do with the power/money of Google behind Chromium.
it's hella ironic a Netscape fanboy account is encouraging the death of browser diversity.
@NetscapeNavigator @gabrielesvelto also SUPER ironic considering Firefox can be kinda considered a successor to Navigator in many ways
@magicalgrrrl Firefox is not only a successor of Navigator, it's literally the same codebase that evolved into it. A lot of functions in our codebase are still prefixed with `NS_` or `ns` (for Netscape). You can still find traces of when the thing ran on OS/2 or Windows 95 if you search hard enough: https://searchfox.org/firefox-main/rev/22d04b52b0eb8d9fa11bf8ede5ccc0243a07c5ba/nsprpub/config/rules.mk#80
@roytam1 DDR5 has ECC only over transfers from the DRAM chips to the memory controller, unfortunately it doesn't have ECC covering the memory itself
@gabrielesvelto
Won't this misclassify issues which originally come from (some_struct_t*)(corrupted_address)->flag=value;
?
And a few cases of *bad_addr += 2^n ? (If the bad address happens to point at a pointer)
Or am I misunderstanding the check?
@natevw
@viraptor @natevw we're doing the check on the base address to avoid issues with offsets. We disassemble the crashing instruction, substract the offset (whether it comes from a literal or another register) then do the computation on the base address. Off-by-one errors to aligned buffers can lead to false positives and that's something I intend to address, but they're not very common https://github.com/rust-minidump/rust-minidump/issues/960
@lispi314 not really, but the XNU kernel has at least some checks for bit-flips that log errors when they're encountered within an executable: https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/bsd/kern/kern_exit.c#L1989-L2035
😊)
Any possibilities that these bit flips are row hammer type issues? 🙂
@simonzerafa some could be but it's hard to tell. Given there's a very strong correlation between the age of the affected machines and the errors I tend to think that's mostly plain flaky hardware.
@michal users with bad memory will experience general instability. Firefox is a good candidate for stumbling upon them and crashing because it generates large amounts of code (JIT compiling JavaScript) and manipulates large amounts of structured data. The garbage collector is the code most likely to hit them. Generally speaking a user with bad memory will experience random instability and data corruption as time goes by.
@gabrielesvelto I'd love to read a blog post about that analysis method you designed, I'm really curious
@pollyglot it's not at all complicated, check it out https://mas.to/@gabrielesvelto/116172856165097305
@gabrielesvelto cool effort! should go to research, maybe? so far, I’ve seen this treated at scale for infrastructure, but much less for client devices.
@promovicz yes, I'd love to do a proper article on this. Unfortunately I'm too busy working on this machinery with my team and keeping the lights on to find the time for proper research 
@gabrielesvelto can you talk about the heuristics used to detect bit flips?
@joxean it's very simple, check the links here: https://mas.to/@gabrielesvelto/116172856165097305
@gabrielesvelto will I as a user be notified that I maybe should do a memory check?
@Bindestriche the crash reporter client will do it automatically if the entire browse crashes, but it won't tell the user, we don't have proper user interface features for that yet
@gabrielesvelto surely it will have some false positive (as in flags a crash as being from a hardware bit flip when it isn't) rate? For example, a bug might accidentally add some large number of bytes to an address such that it goes out of range of the mapped regions, but in a way that a bit flip can remove 2^n from that address and move it back into a valid range? I think I don't have the intuition to say how often that would happen but some simple stochastic modelling could probably model how likely this all is
@pollyglot yes it can, but it proved to be very rare in practice. This is the only real case that happens with a certain frequency and I'll get around to fixing it soon enough: https://github.com/rust-minidump/rust-minidump/issues/960
😊)
Has there been any discussion of a Firefox Labs experiment to gather more information? 🙂
@gabrielesvelto Really cool insight! Now I understand why Linus Torvalds was so angry with Intel making ECC an enterprise feature.. I wonder whether your metric is correlated with solar events
@campfireman not really, but it's strongly correlated with the age of the machine
@campfireman I also found correlations to heat waves for other type of hardware issues: https://mas.to/@gabrielesvelto/114813152373394985
@gabrielesvelto
Were you able to find a way to fingerprint issues like e.g. Intel 13 and 14 gen partially killing itself?
I had an issue where every time the system went into idle state it would cause memory errors and applications to crash with SIGINT and SIGSEGV. That issue at the time was almost impossible to pinpoint for me and I ended up replacing all components until I finally replaced the CPU.
That was in part because load would make the system stable, so all memtests were fine...
@agowa338 we have different type of heuristics to detect CPU bugs too: https://github.com/rust-minidump/rust-minidump/blob/26cd5dd6eb0ddfb29bcb7ed2d8c75cd8e6368203/minidump-processor/src/processor.rs#L838
And an entire collection of bugs that we know happen only on Raptor Lake (we stopped counting at some point as our automatic system was filing so many of them): https://bugzilla.mozilla.org/show_bug.cgi?id=1975808
@gabrielesvelto
(Until I was able to get new hardware I at some point had a crypto miner running in the background as the load it caused made the system stable. And that alone made me pull out my hair as it didn't appear to make any sense)
@gabrielesvelto I don't have any numbers on this, but anecdotally, this seems to match with the experience from btrfs. The filesystem does a lot of consistency checking, and is very good at detecting corruption from bad hardware. We see a lot of problem reports on IRC where we can confidently identify it as bad hardware.
It's not always the RAM. Sometimes it's the power supply, or power regulation on the motherboard: the "bad bits" move around in those cases.
And then there's the disks...
You don't have a standalone version of all of that tooling that just outputs a text file with "check xyz: Passed/Failed"?
That would be quite useful for a lot of cases actually.
@agowa338 yes, you can build a tool called `minidump-stackwalk` from that codebase. It does the analysis (among other things) but you need a minidump taken from a crash report first to analyze (other applications generate those: Chrome, Steam, Unity, Windows Error Reporting, etc...). This tool woks with all of them.
A Firefox crash report I assume? Or would any minidump from a crash of any application work?
Also do you know how many of the checks would work without one (or be able to take one itself)?
@agowa338 any crash report that uses minidump, lots of Windows applications do. If you're experiencing frequent instability on your system you could enable local dumping by WER and then analyze the minidumps it takes yourself:
https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps
@vathpela How would that even work? Any memory allocated up until the point where you switch the setting will have to remain in place, negating the benefits or randomized allocation for everything that starts early. And both allocators would have to work with the same in-memory format, which may or may not be possible?
@guenther @vbabka @oleksandr @gabrielesvelto right, like I said there are always difficulties and trade-offs. You might have to flip the switch and then re-start tasks or kexec or other things, or something else (who knows, someone would have to design it).
scan the physical memory associated to a process if it crashes. It could be done probabilistically like we do so as not to do it too much.
Oh wow, I hope processes are not crashing so much that there's only enough resources to sometimes check their hot memory for hardware errors!
@datum @gabrielesvelto It doesn't need to be a frequent-crash thing. If it's a lazy stochastic approach, the idea is basically a self-healing system.
@gabrielesvelto @roytam1 It's the other way around: DDR5 has on-chip ECC, but no ECC lines between the RAM chips and the memory controller. That's why "real" ECC needs compatible sticks, motherboard *and* CPU.
@SamantazFox @roytam1 you're right, I mixed it up with CRC on transfers
@vathpela @vbabka @oleksandr @gabrielesvelto @guenther sounds like a stupid comment. A running kernel has already completed the majority of memory allocations it will ever need, so toggling such an option by then would have no effect. Unless you want such toggling to force a complete realloc of all kernel memory, which would be even more stupid.
@hyc And while you are solving this, make it possible to page out kernel memory :D.
@gabrielesvelto @pidgin dang, we've been actively trying to not pull rust into our already overly complicated builds, especially on windows :-/
@gabrielesvelto @rygorous I've been down this road! I worked on the installer for the original World of Warcraft. It decompressed and recompressed assets on the fly during installation. This was a serious problem for us! It turns out that running Bzip2 on a machine for an hour is a real stress-test.
@gabrielesvelto @rygorous It was surprising how much user pushback we would get, as well. "I've run this machine overclocked for a year and it's totally stable" etc.
@stilescrisis @gabrielesvelto Yeah it's the worst.
@rygorous @stilescrisis I've had crash reports with comments like "Firefox is crashing all the time, but yesterday Baldur's Gate 3 also crashed, maybe it's related?"
@gabrielesvelto off topic, and I’m very sorry, but I love Firefox and don’t know how else to reach out. I get a lot of crashes when I run a shared Firefox binary that I use from three accounts on macOS. I have configured them so only my main account should be updating, but even so it generally crashes on shutdown.
How can I report this?
@mauvedeity I'm literally in charge of Firefox stability so you couldn't have chosen a better contact. Does the crash reporter show up and did you submit the crashes? If yes navigate to about:crashes and check a few of the links there. Send them to me and I'll figure out which bug they belong to
@gabrielesvelto @rygorous @stilescrisis meanwhile, manufacturers: you need to enable the xmp overclock profile to set the ram to the 1 morbillion giggahurts that was on the box
@gabrielesvelto neat!
Could you track just the number of crashes/etc across longer time frames without any PII? I bet there could be lots of interesting patterns in there
@kyhwana yes, we can do that with crash telemetry though it has significant limitations as it's purged of anything that's even remotely PII
@hyc @vbabka @oleksandr @gabrielesvelto @guenther you have completely missed the point and decided to do some name calling. Classy.
@vathpela @vbabka @oleksandr @gabrielesvelto @guenther your point has no merit.
@hyc @vathpela @oleksandr @gabrielesvelto @guenther incorrect, it has mine.
@vbabka @vathpela @oleksandr @gabrielesvelto @guenther eh? Your earlier comment is in direct opposition to his.
@gabrielesvelto Like at a minimum have it date >> ~/.mozilla/firefox/memtest_fail or something, come on!
@purpleidea well, yes, we could definitely do that, or store it so that it can be reached from about:support
@gabrielesvelto How many of these could be transient bitflips? I remember a blog post about solar radiation causing problems... somewhere...
@nowwhat I can't be sure but I believe not many. The biggest correlation I have in this data is with machine age, which suggests that it might be mostly memory failing over time
@gabrielesvelto @agowa338 are the raptor lake crashes the known big/little cache /tlb bugs that got fixed later?
@erikarn no, only a handful seem to have been fixed. The bulk of those crashes are people suffering from the voltage issues and the permanent damage it may cause
@gaikokuking I'm thinking about how to do that. The biggest issue is that we don't want it to look like the many scams pretending there's something wrong with your machine
@gabrielesvelto is there a plan to show this result to the user? Might be useful.
@anselmschueler yes, we have to figure out how without looking like an internet scam (your machine is borked, click here to fix it!)
@storagenerd it is similar to actual test programs such as memtest86+ in that it attempts a few different tests to root out problems, but it's much, much simpler, only runs for 3 seconds at most and doesn't scan more than 1 GiB of memory. See the source here: https://searchfox.org/firefox-main/source/toolkit/crashreporter/client/app/src/memory_test.rs
@highvoltage and at that time you could still buy memory with parity! There was still an expectation that client systems would have the option of having at least a check that everything was fine
@gabrielesvelto Fun fact: Since Firefox v146 was released recently , I have experienced regular browser tab crashes which I hardly, if ever, had before v146. It has marginally improved in v147 and v148 but in v148.0.2 I just had two tab crashes in 10 minutes.
I have a self-built PC from 2024 with 32GB of RAM, which should be plenty and modern enough.
Interestingly, the LibreWolf fork, which is on v148.0.1 currently and which I have configured 100% the same, does not crash tabs. 1/x
@gabrielesvelto Of course N=1 but I would suspect the culprit isn't my system or its RAM, but Firefox itself.
The fact that a fork that's stripped of the AI bloat that Mozilla vehemently wants to shoe-horn into Firefox has no stability problems whatsoever, makes me suspect that there's something else causing the problems than 'bit flips' or any other reason you're trying to blame for Firefox's tab crash problems. 2/x
@ElBeeToots LibreWolf completely disables crash reporting, so in many cases the browser will just silently restart crashed processes. That being said if you send me the links to the crash reports we can figure out what's wrong with it.