Once upon a time, a supercharged particle named Ray began life commonly enough by being ejected from the south polar region of a black hole. Ray buzzed through space for thousands of millennia, unknowingly headed straight for your house. Ray managed to punch through our Van Allen radiation belts, glided through our atmosphere, passed through the roof of your house, weaved through the drywall and insulation and support beams, penetrated your computer case and the chip resin on one of your RAM boards, and finally bonked against a silicon NAND gate in just the right way. Ray caused a voltage issue which resulted in the representation of an electrical charge to mean a 1 instead of a 0 to your computer.
Your computer wasn't paying attention to that particular NAND gate on that refresh cycle that Sunday afternoon, and you went about your business with no idea Ray ever existed.
Programmers spend almost all their time planning for disaster.
Disasters come in all sizes, and even simple interface elements that appear to do nearly nothing are burdened with scads of error checking. Say you need the user to enter a price in a text box. The first problem you might account for is length; You need to limit the number of characters the user is allowed to type. Then you need to prevent them from entering characters that aren’t numbers (except for ‘-‘ and ‘.’, unless the user is in Europe or Québec in which case you need to allow ‘,’ instead of ‘. unless you want to display thousands separators), then you probably have to check the price is reasonable, then you need to be sure that when the user hits TAB they get to the next field they expect, etc, etc, etc.
When you design a logic system, you first discuss all the things that can go wrong because in terms of design, the pathways that handle things going right are so tiny in number that you barely even have to consider them.
So, naturally, the job of a programmer is filled with negativity. You focus on failure. You plan for it, you code for it, and you engineer for many more failure cases than success ones. If you’re doing your job properly, you’ve worked out the correct handling for enough error cases that what you’re building actually functions in a way that is useful to someone.
Several years ago I worked on an archival filesystem project which was mostly about dealing with physical errors in the real world. It occurred to me just how much the computer sitting in front of me and the software running on it is about errors. All the design and invention on the shoulders of giants is 99% about shit being wack.
This isn’t much different from life in general. The chances of things going wrong in life are so much greater than the chances of things going right – just ask anyone who ever woke up one morning.
This is why accomplishments like the successful crane-lowering of Curiosity on the surface of Mars completely blows me away. There were a tiny handful of ways that thing could’ve gone right and an infinite number of ways it could’ve gone wrong. Something like that only goes right because the minds behind it were good. Very, very, very good, sciencey, mathy, obsessive good. The engineering and error handling spirals from large problems like making sure rockets fire at the right time all the way down to the possibility of an error occuring in a microscopic semiconductor circuit on one of Curiosity’s many chips.
I was at the Texas Burlesque Festival in Austin a few years ago, and while the glamourous and mostly naked surrounded us in a buzzing lobby at intermission, I was lost in a conversation with a guy who turned out to be a satellite chip designer. He described what a collosal PAIN IN THE ASS it is to protect computer memory once it's outside earth's precious protective fields. The error correction has to be so robust that performance is abysmal and you actually want to keep the amount of memory as low as possible just to reduce the chances of errors from particles like our friend Ray.
Wired put up a great article years ago about the computer memory that runs in the BAE RAD750 computers that help run Curiosity. It’s all about supercharged interstellar particles and how they can affect computer memory.
The chances of data loss are never zero, but under certain circumstances the error tolerance for a computer has to be zero. Nuclear reactors, the particle accelerator at CERN, and airbourne guidance systems are all things we need to run perfectly. This is why Error Control Coding (ECC) exists. Semiconductor memory with ECC runs slower than non-ECC memory, but the extra time it takes to operate is spent verifying that the data stored is exactly what was supposed to be stored. This performance hit and the fact that particle-induced bit flipping is such a rare occurrence on earth are why you don’t find ECC RAM in the typical home computer.
Two of the more interesting factors that increase the likelihood of particle-induced error are the size of the chip circuitry, and the altitude it is running at.
The Wired article quotes research scientist Al Geist at Oak Ridge National Laboratory who was studying bit-flipping on their Jaguar supercomputer. He found that the ECC on that machine’s 362 TB of memory was being triggered more than 300 times per second. Geist says “The computer continued to run fine despite this continuous stream of errors because these bit flips were all corrected.” I found this fascinating not only from the point of view of a computer nerd, but also from a “holy shit are there really that many cosmic rays bouncing around down here?” point of view. To my relief, it turns out those ECC triggers are far more likely to be caused by chip degredation due to aging and not a gaggle of Rays bouncing around.
Altitude is an obvious factor. The further beyond earth’s protection you are, the more cosmic radiation is present around you. Even if you’re flying in a plane you’re 100 times as likely to experience a bit-flip on your laptop than you would be at home, according to the article. Curiosity’s RAM is way over-engineered (and running full ECC) so that it could survive launch, the trip, Mars atmospheric entry, the landing process, and then operate for years with several times more supercharged particles flying around than you’d find in your living room.
Accounting for cosmic rays in kernel-space filesystem code was an awfully interesting exercise to be part of, and it was hard for a sci-fi geek like me not to stop and wonder how many little Rays were in the room with me and marvel at their inevitability.