I built a whole remote software update mechanism for a control binary that ran on 25k+ servers across multiple data centers.
Rest assured that after the first time I messed it up (which required ssh into each box individually), I wrote a lot of unit and integration tests to make sure that it never failed to deploy again. One of the integration tests ensured that the app started up and could always go through the internal auto update process. This ran in CI and would fail the build if it didn't pass.
While I fully understand that this is hard to get right 100% of the time, a mess up of this level by a car manufacturer is pretty amazing to me.
Rivian is an embedded use case, though, which is not at all like a fleet of servers.
Having worked for companies that produce network devices - including devices that are unreachable for example for 6 months of the year - and on software installation and upgrade, I am baffled how this bricking is possible. For one thing, you generally use some kind of confirmed boot mechanism - you upgrade a standby partition, set an ephemeral boot value that causes device to boot the alternate image, and reboot - only when the image is declared "up" does that get persisted (and then the alternate is upgraded, in order to prevent rollback in the event of a media error). You use watchdogs that are tied to actual forward progress (and not just some demon that the kernel schedules and bangs on the watchdog even if the rest of the system is hung) and if they fail, the WD reboots you. (This is one of the reasons that event driven programming is somewhat preferred - actually processing events from a single dispatch thread makes it easier to reason about the system.)
On top of that, you make sure that the core system is an immutable filesystem so that you can validate the _offline_ alternate image before rebooting (write-and-read-back-uncached) and periodically scrub the alternate image (same).
Like.. this is all embedded 101, stuff people have been widely doing since the mid 1990s and I think I can find examples going back to the 70s. Sometimes you get a little more sophisticated (allow sub-packages or overlays and use a manifest to check the ensemble instead of just a single image), but it's very standard.
Assuming Rivian does know embedded 101, my guess is that the infotainment system is running Android and the watchdog reported all green once the system services all came online and that it doesn't actually check whether the application layer is really working because, as you know, that would require the watchdog to run a full regression suite before giving the okay, which isn’t practical. Since the update swapped the system to an internal dev cert, they cant push an immediate update to change the boot args because the management plane daemon won’t connect to the C&C server, or it can but the blob they push wouldn’t pass signature validation, or the TEE won’t unlock the device keys because the roots changed. Whatever the case, someone has to go blow a fuse and re-flash the thing, or at least rewrite the boot args via serial. Just a guess.
If it is the most likely “management plane TLS certs” issue, I bet the watchdog won’t confirm the new boot args until the command dispatch daemon gets a pong from the C&C server moving forward (:
Hey now, preventing SEVs doesn't lead to impact. If we all collectively let this become a raging dumpster fire we can all heroically fix it and greatly exceed expectations for the half.
Did you just use standard Yocto or similar tools to build such images? Are there standard daemons for managing hardware watchdogs (besides systemd since that's too simple as you say)? I think there's a lot of niche knowledge in the embedded space and many programmers are used to cloud systems and at most target. The most embedded experience most programmers have is likely iOS/Android development where all of the actual embedded concerns are handled for you. Even Google (soft)bricked a bunch of phones with the latest Android 14 update [1].
IMO there's not a lot of regular OSS for building embedded systems that comes with A/B partitioning, watchdogs, secure and verified boot - it's all custom at every org and tailored for individual products.
As somebody currently working at an automaker on software systems, the amazing thing to me is that a mess up of this level doesn’t happen weekly. It’s rough out here.
Thank you. At least you're honest about it, the other day someone was trying real hard to convince me that software developers at automakers are made of magic fairy dust.
Relatively crappy pay, complex toolchains, long build times, layer upon layer of (really bad) legacy code, badly specified (if they're specified) protocols between subsystems, subsystems that are completely opaque (no source code provided), homegrown OS's or older RTOS's, subset-of-C to keep it safe(r), tricky debugging environments and if you're really unlucky anemic hardware.
I hope I didn't miss anything but I wouldn't be surprised if I did.
Yeah, I think you missed something. The "software architects, heavy enterprisey tooling, and minions" approach to development where some of the architects could be good developers, but they don't develop, and the minions are often not that good and also not given any autonomy, so they are in a state of learned helplessness and just do what they're told without much thinking or initiative. It results in over-abstracted, over-complicated, slow, unreliable, and sometimes just stupid code.
Automotive varies widely between "basically modern Linux systems with proper updates" and the most janky, home-grown update systems imaginable, sometimes even within the same components and teams.
Yah, I know from friends at ford and vw that there's still vxworks and qnx, but even there, good grief, a-b with confirmed boot is about as basic as you can get.
I confess I've seen incredible sloppiness about when a confirmation is done (too early, including in the initial init stages which is way too soon) and watchdogs (spawn off a process that has a while loop stroking the wd - just absolutely pointless).
I've heard all of the above, often "stroking". I never used those because I like systems where you have a random challenge code to respond to. Then the software has to not be acting as wonky to react correctly.
Indeed, a good RTOS from 10-20 years ago works just as good now as it did back then. The only things that change are dev environments and the driver support.
> This ran in CI and would fail the build if it didn't pass.
I don't mean to be pedantic, but since we're talking about what should happen instead, this is insufficient. It works until the day you realize you made some kind of manual change to your CI infra, or that CI has some non-standard configuration that makes it work for you but not some significant fraction of the fleet.
People should do what you described in CI, but as well as that, you need phased rollout, where e.g. the build can only be rolled out to the next percentage point of randomly selected users in a specific segment (e.g. each hardware revision and country as independent segments) after meeting a ratio of successful check-ins, in the field, from the new build by production customers in that segment. That's the actual metric for proceeding with the rollout: actual customers are successfully checking in from the new version of the software.
Except, that's actually not sufficient either. What if the new build is good, but it contains an update to the updater which bricks the updater? Now you're getting successful check-ins from the new version in the field, but none of those customers will ever successfully auto-update again. So, test the new updater's ability to go forwards successfully, too.
A good way to handle the who-updates-the-updater issue is to use a triple partition updater. A updates B, and then B updates C, then C updates A. If anything about the new version prevents it from properly updating its neighbor, that neighbor won't be able to close the loop, and you'll fall back to A. This simplifies the FSBL, because it just boots the three partitions in a loop, no failure detection required. You don't need to triplicate the full application either, just the minimum system needed to perform an update, and then have the "application" in it's own partition to be called by the updater.
> It works until the day you realize you made some kind of manual change to your CI infra, or that CI has some non-standard configuration that makes it work for you but not some significant fraction of the fleet.
Nah, my CI process was solid. This was proven in the field over the course of years.
> I don't mean to be pedantic... you need phased rollout
You don't need to be pedantic, but better to ask the question rather than assume that was all that I did. =) You have to realize that what I built, worked flawlessly. It wasn't easy either, took a lot of trial and error.
I did have a CIDR based rollout. I could specify down to the individual box that it would run a specific version. Or I could write "latest" to always keep certain boxes running on the latest build. This was another part of my testing, but ended up not being fully necessary because I had enough automated testing in CI that "latest" always worked.
> but it contains an update to the updater which bricks the updater?
This happened, so I wrote a lot of test code to make sure that would never happen again. My CI would catch that since I was E2E testing that it could actually run the upgrade process.
Once I implemented all of this, I never had a single failure and would routinely, several times a day, deploy to the entire cluster, over the course of a couple years.
It was all eventually consistent as I could also control the "check for update" frequency as well.
I think there's a minor confusion here, where you think the purpose of my response involves doubting whether your system was successful. I understand it was successful. My response is to the sense in which your comment can be interpreted as advice to other people on what they should build.
I think the fact that you were able to survive with CI-only doesn't mean that we should encourage others to skip implementing a phased rollout based on verified customer successes, including testing of their new updaters before the first time they accidentally brick all the updaters, rather than afterwards. That's what I was hoping to help avoid, through my comment.
Interesting. At any point in time, I had errors from hardware, software and networking. Even the racks would be getting overwhelmed at certain times. Simply being able to ssh into every host wasn't guaranteed. I'm not sure how you did it.
+1 to this, we have a 0.1% hardware failure rate every time we do a rolling restart (40-50k nodes). Some just never come back, in the best case, but actively misbehave in the worst. If the node is unresponsive we remove it from the cluster and fix it async.
If the daemon was running, it would ping a central server on a schedule and report its status, the response from the server was if there was a new version available (with the binary in the response), or not. This combined ping/update service really cut down on the overall traffic, and failures.
If the machine had crashed, it would start up, start my daemon, and that daemon would start the ping/update process all over again.
A large portion of the machines were iPXE booted... so, just reboot was one option and it would all start from scratch again.
Yes, some of the boxes had flaky power supplies or would fail an ssd, and that would cause a technician to go out and manually fix things.
I found it was critical to think of everything as eventually consistent because my hardware was boxes with 12 GPUs and they were flaky and would crash the whole box randomly. I got used to boxes rebooting hundreds of times. My process would also auto-tune the GPU for stability too, changing clock/power settings until the individual cards would become stable and stop the crashing.
The only time I had problems was when the daemon was dead. I had a dashboard where I could see which machines hadn't reported their status. It was easy to pick those off by hand.
I used to wear your shoes in IE6/7 ages (no longer, I gave up during the "framework of the week" race and went all-backend), and it wasn't simple at all. Browser compatibility with all their rendering nuances, individual system oddities and all sort of fragile stuff.
And fortunately, no one bats an eye at a slightly broken site, but everyone hates even a slightly broken vehicle.
It's simple because we tolerate certain limitations in the web platform.
If you had a hard requirement that a page load could never take more than 100ms, regardless of network conditions, you'd have quite a challenge on your hands.
I'm not really a frontend dev any more but I was for a long time. I can assure you that the only reason you think your code works is because no one tells you it's broken. If you use an error logging or telemetry service (Sentry, Rollbar, New Relic, etc) you will be aware that errors happen in frontend code all the time. It's just that most of the time bugs don't crash the app, and the user doesn't know what to expect so they see a broken feature and think it's meant to be like that.
> While I fully understand that this is hard to get right 100% of the time, a mess up of this level by a car manufacturer is pretty amazing to me.
I feel like it's going to happen to someone that makes network devices eventually. I'm always scared to update my (several hundred) UniFi devices. Their update process isn't foolproof and they push auto-updates via the UI pretty hard.
Several years ago they caused some people's devices to disconnect from the management controller when they enabled 'https' communication. Prior to that, if you were pointing devices at 'https://example.com:8080...' they would ignore the 'https' part and do an 'http' request to port '8080'. Then they pushed their 'https' update which expected an 'https' connection and didn't fall back to the old behavior for anyone that was mistakenly using 'https' in their URL initially. Some people on their forums complained about having to manually SSH to every device to fix the issue.
It was caused by an end-user mistake, but they knew it was a potential issue. AFAIK, their attitude on it hasn't changed and a lot and at the time their response was that they knew it would break some people, but that it wouldn't be that many (lol).
IMO, the issue with those systems is that basic communication back to the update / config server is part of the total package which is too complex (ie: a full Debian install). I'd rather see something like Mender (mender.io) where the core communications / updates come from a hardened system with watchdog, recovery, rollback logic.
Think of how crazy it is to have something like pfSense doing package based updates rather than slice based updates. At least with boot environments they could add some watchdog and rollback type logic, but it'll still be part of the total system instead of something like a hardened slice based setup where the most critical logic is isolated from everything else and treated like a princess.
Do you have any insight on package vs slice based systems for updates? Did you isolate update logic from the rest of the system or am I out of touch with that opinion?
When possible, I used a fail back mechanism. If the update failed to fully come up, then the watchdog timer would catch it, the bootloader would notice the incomplete boot, and attempt to boot from the previous known working image in that case.
out of morbid curiosity.... how long did it take to ssh into and fix all of those servers? I imagine even automating a fix (if possible) would still take a good amount of time.
The way I built my app was that I could install it cleanly via a curl | bash.
So, I just had a simple shell script that iterated through the list of IP addresses (from the DHCP leases), ran curl | bash and that cleaned up the mess pretty quickly.
As a non-developer, the whole situation with a bad software update to the Voyager spacecraft really puts things into perspective as far as how bad remote updates can be.
It’s also a testament to the way that the system was designed that they were able to get it back online.
Rest assured that after the first time I messed it up (which required ssh into each box individually), I wrote a lot of unit and integration tests to make sure that it never failed to deploy again. One of the integration tests ensured that the app started up and could always go through the internal auto update process. This ran in CI and would fail the build if it didn't pass.
While I fully understand that this is hard to get right 100% of the time, a mess up of this level by a car manufacturer is pretty amazing to me.