I built a whole remote software update mechanism for a control binary that ran o...

foobiekr · on Nov 15, 2023

Rivian is an embedded use case, though, which is not at all like a fleet of servers.

Having worked for companies that produce network devices - including devices that are unreachable for example for 6 months of the year - and on software installation and upgrade, I am baffled how this bricking is possible. For one thing, you generally use some kind of confirmed boot mechanism - you upgrade a standby partition, set an ephemeral boot value that causes device to boot the alternate image, and reboot - only when the image is declared "up" does that get persisted (and then the alternate is upgraded, in order to prevent rollback in the event of a media error). You use watchdogs that are tied to actual forward progress (and not just some demon that the kernel schedules and bangs on the watchdog even if the rest of the system is hung) and if they fail, the WD reboots you. (This is one of the reasons that event driven programming is somewhat preferred - actually processing events from a single dispatch thread makes it easier to reason about the system.)

On top of that, you make sure that the core system is an immutable filesystem so that you can validate the _offline_ alternate image before rebooting (write-and-read-back-uncached) and periodically scrub the alternate image (same).

Like.. this is all embedded 101, stuff people have been widely doing since the mid 1990s and I think I can find examples going back to the 70s. Sometimes you get a little more sophisticated (allow sub-packages or overlays and use a manifest to check the ensemble instead of just a single image), but it's very standard.

dcow · on Nov 15, 2023

Assuming Rivian does know embedded 101, my guess is that the infotainment system is running Android and the watchdog reported all green once the system services all came online and that it doesn't actually check whether the application layer is really working because, as you know, that would require the watchdog to run a full regression suite before giving the okay, which isn’t practical. Since the update swapped the system to an internal dev cert, they cant push an immediate update to change the boot args because the management plane daemon won’t connect to the C&C server, or it can but the blob they push wouldn’t pass signature validation, or the TEE won’t unlock the device keys because the roots changed. Whatever the case, someone has to go blow a fuse and re-flash the thing, or at least rewrite the boot args via serial. Just a guess.

If it is the most likely “management plane TLS certs” issue, I bet the watchdog won’t confirm the new boot args until the command dispatch daemon gets a pong from the C&C server moving forward (:

ikiris · on Nov 15, 2023

That sounds out of scope for the MVP. We can worry about redundancies later after we ship.

roland35 · on Nov 15, 2023

Hey now, preventing SEVs doesn't lead to impact. If we all collectively let this become a raging dumpster fire we can all heroically fix it and greatly exceed expectations for the half.

bluGill · on Nov 15, 2023

You too have noticed most great employee rewards go to someone who if they had done their job well wouldn't have been notived

ikiris · on Nov 15, 2023

This guy FANGs

KingMachiavelli · on Nov 15, 2023

Did you just use standard Yocto or similar tools to build such images? Are there standard daemons for managing hardware watchdogs (besides systemd since that's too simple as you say)? I think there's a lot of niche knowledge in the embedded space and many programmers are used to cloud systems and at most target. The most embedded experience most programmers have is likely iOS/Android development where all of the actual embedded concerns are handled for you. Even Google (soft)bricked a bunch of phones with the latest Android 14 update [1].

IMO there's not a lot of regular OSS for building embedded systems that comes with A/B partitioning, watchdogs, secure and verified boot - it's all custom at every org and tailored for individual products.

[1] https://arstechnica.com/gadgets/2023/11/android-14-patches-r...

MarkSweep · on Nov 15, 2023

I quit my job before I got to deploy this, but RAUC looked like it would handle this for Yocto:

https://github.com/rauc/rauc https://github.com/rauc/meta-rauc

For microcontrollers, Memfault had a good article:

https://interrupt.memfault.com/blog/device-firmware-update-c...

neuralRiot · on Nov 15, 2023

> including devices that are unreachable for example for 6 months of the year

That made me think, imagine NASA bricking up the voyager with a SW update.

aaronbeekay · on Nov 14, 2023

As somebody currently working at an automaker on software systems, the amazing thing to me is that a mess up of this level doesn’t happen weekly. It’s rough out here.

jacquesm · on Nov 15, 2023

Thank you. At least you're honest about it, the other day someone was trying real hard to convince me that software developers at automakers are made of magic fairy dust.

kalleboo · on Nov 15, 2023

I'm amazed anyone would argue that after the Toyota firmware analysis.

jacquesm · on Nov 15, 2023

Check out the thread a couple of days ago:

https://news.ycombinator.com/item?id=38244149

bozhark · on Nov 14, 2023

What's the priority then, telemetry data? Why is it rough out there?

jacquesm · on Nov 15, 2023

Relatively crappy pay, complex toolchains, long build times, layer upon layer of (really bad) legacy code, badly specified (if they're specified) protocols between subsystems, subsystems that are completely opaque (no source code provided), homegrown OS's or older RTOS's, subset-of-C to keep it safe(r), tricky debugging environments and if you're really unlucky anemic hardware.

I hope I didn't miss anything but I wouldn't be surprised if I did.

ahartmetz · on Nov 15, 2023

Yeah, I think you missed something. The "software architects, heavy enterprisey tooling, and minions" approach to development where some of the architects could be good developers, but they don't develop, and the minions are often not that good and also not given any autonomy, so they are in a state of learned helplessness and just do what they're told without much thinking or initiative. It results in over-abstracted, over-complicated, slow, unreliable, and sometimes just stupid code.

jacquesm · on Nov 15, 2023

Fair enough, yes. That's hopefully not all of them though but I don't doubt that many of the older companies work like that.

ahartmetz · on Nov 15, 2023

Most car companies are, in fact, quite old. Their big suppliers (who are often even worse at software, if you can believe that) are also quite old.

reactordev · on Nov 14, 2023

Probably due to fires, failures, and fatigue.

firtoz · on Nov 15, 2023

Games have AAA, autos have FFF

foobiekr · on Nov 14, 2023

do you guys not have confirmed boot and swizzling to fallback images?

AlotOfReading · on Nov 15, 2023

Automotive varies widely between "basically modern Linux systems with proper updates" and the most janky, home-grown update systems imaginable, sometimes even within the same components and teams.

foobiekr · on Nov 15, 2023

Yah, I know from friends at ford and vw that there's still vxworks and qnx, but even there, good grief, a-b with confirmed boot is about as basic as you can get.

I confess I've seen incredible sloppiness about when a confirmation is done (too early, including in the initial init stages which is way too soon) and watchdogs (spawn off a process that has a while loop stroking the wd - just absolutely pointless).

cozzyd · on Nov 15, 2023

I've seen kicking and petting the watchdog, but this is my first time seeing stroking

LoganDark · on Nov 15, 2023

Sometimes the watchdog needs to have fun too, you know.

WWLink · on Nov 15, 2023

I've heard all of the above, often "stroking". I never used those because I like systems where you have a random challenge code to respond to. Then the software has to not be acting as wonky to react correctly.

ahartmetz · on Nov 15, 2023

From experience, QNX is actually very nice. I wouldn't say "still using QNX" like it's some crap that nobody would want.

mips_r4300i · on Nov 15, 2023

Indeed, a good RTOS from 10-20 years ago works just as good now as it did back then. The only things that change are dev environments and the driver support.

cjbprime · on Nov 15, 2023

> This ran in CI and would fail the build if it didn't pass.

I don't mean to be pedantic, but since we're talking about what should happen instead, this is insufficient. It works until the day you realize you made some kind of manual change to your CI infra, or that CI has some non-standard configuration that makes it work for you but not some significant fraction of the fleet.

People should do what you described in CI, but as well as that, you need phased rollout, where e.g. the build can only be rolled out to the next percentage point of randomly selected users in a specific segment (e.g. each hardware revision and country as independent segments) after meeting a ratio of successful check-ins, in the field, from the new build by production customers in that segment. That's the actual metric for proceeding with the rollout: actual customers are successfully checking in from the new version of the software.

Except, that's actually not sufficient either. What if the new build is good, but it contains an update to the updater which bricks the updater? Now you're getting successful check-ins from the new version in the field, but none of those customers will ever successfully auto-update again. So, test the new updater's ability to go forwards successfully, too.

quailfarmer · on Nov 15, 2023

A good way to handle the who-updates-the-updater issue is to use a triple partition updater. A updates B, and then B updates C, then C updates A. If anything about the new version prevents it from properly updating its neighbor, that neighbor won't be able to close the loop, and you'll fall back to A. This simplifies the FSBL, because it just boots the three partitions in a loop, no failure detection required. You don't need to triplicate the full application either, just the minimum system needed to perform an update, and then have the "application" in it's own partition to be called by the updater.

latchkey · on Nov 15, 2023

> It works until the day you realize you made some kind of manual change to your CI infra, or that CI has some non-standard configuration that makes it work for you but not some significant fraction of the fleet.

Nah, my CI process was solid. This was proven in the field over the course of years.

> I don't mean to be pedantic... you need phased rollout

You don't need to be pedantic, but better to ask the question rather than assume that was all that I did. =) You have to realize that what I built, worked flawlessly. It wasn't easy either, took a lot of trial and error.

I did have a CIDR based rollout. I could specify down to the individual box that it would run a specific version. Or I could write "latest" to always keep certain boxes running on the latest build. This was another part of my testing, but ended up not being fully necessary because I had enough automated testing in CI that "latest" always worked.

> but it contains an update to the updater which bricks the updater?

This happened, so I wrote a lot of test code to make sure that would never happen again. My CI would catch that since I was E2E testing that it could actually run the upgrade process.

Once I implemented all of this, I never had a single failure and would routinely, several times a day, deploy to the entire cluster, over the course of a couple years.

It was all eventually consistent as I could also control the "check for update" frequency as well.

cjbprime · on Nov 15, 2023

I think there's a minor confusion here, where you think the purpose of my response involves doubting whether your system was successful. I understand it was successful. My response is to the sense in which your comment can be interpreted as advice to other people on what they should build.

I think the fact that you were able to survive with CI-only doesn't mean that we should encourage others to skip implementing a phased rollout based on verified customer successes, including testing of their new updaters before the first time they accidentally brick all the updaters, rather than afterwards. That's what I was hoping to help avoid, through my comment.

jacquesm · on Nov 15, 2023

And you need to verify the vehicle is not in motion.

psychlops · on Nov 14, 2023

Having worked on 25K machines, I can assure you that it never deployed to every single machine and failed to do so in interesting ways all the time.

latchkey · on Nov 15, 2023

It always deployed. It was eventually consistent. Any failure would automatically be resolved after a period of time.

psychlops · on Nov 15, 2023

Interesting. At any point in time, I had errors from hardware, software and networking. Even the racks would be getting overwhelmed at certain times. Simply being able to ssh into every host wasn't guaranteed. I'm not sure how you did it.

kuchenbecker · on Nov 15, 2023

+1 to this, we have a 0.1% hardware failure rate every time we do a rolling restart (40-50k nodes). Some just never come back, in the best case, but actively misbehave in the worst. If the node is unresponsive we remove it from the cluster and fix it async.

latchkey · on Nov 16, 2023

If the daemon was running, it would ping a central server on a schedule and report its status, the response from the server was if there was a new version available (with the binary in the response), or not. This combined ping/update service really cut down on the overall traffic, and failures.

If the machine had crashed, it would start up, start my daemon, and that daemon would start the ping/update process all over again.

A large portion of the machines were iPXE booted... so, just reboot was one option and it would all start from scratch again.

Yes, some of the boxes had flaky power supplies or would fail an ssd, and that would cause a technician to go out and manually fix things.

I found it was critical to think of everything as eventually consistent because my hardware was boxes with 12 GPUs and they were flaky and would crash the whole box randomly. I got used to boxes rebooting hundreds of times. My process would also auto-tune the GPU for stability too, changing clock/power settings until the individual cards would become stable and stop the crashing.

The only time I had problems was when the daemon was dead. I had a dashboard where I could see which machines hadn't reported their status. It was easy to pick those off by hand.

postalrat · on Nov 15, 2023

As a frontend web developer I'm constantly deploying software to many thousands of machines. And you know what? It's pretty damn simple.

drdaeman · on Nov 15, 2023

I used to wear your shoes in IE6/7 ages (no longer, I gave up during the "framework of the week" race and went all-backend), and it wasn't simple at all. Browser compatibility with all their rendering nuances, individual system oddities and all sort of fragile stuff.

And fortunately, no one bats an eye at a slightly broken site, but everyone hates even a slightly broken vehicle.

jrumbut · on Nov 15, 2023

It's simple because we tolerate certain limitations in the web platform.

If you had a hard requirement that a page load could never take more than 100ms, regardless of network conditions, you'd have quite a challenge on your hands.

onion2k · on Nov 15, 2023

The laws of physics are definitely very challenging. If you've got a solution please write a blog post.

jrumbut · on Nov 15, 2023

No blog post required! You just install the whole app on a dedicated piece of hardware on site.

But then deployment becomes more challenging ;)

postalrat · on Nov 15, 2023

Those deployments are called PWAs.

onion2k · on Nov 15, 2023

I'm not really a frontend dev any more but I was for a long time. I can assure you that the only reason you think your code works is because no one tells you it's broken. If you use an error logging or telemetry service (Sentry, Rollbar, New Relic, etc) you will be aware that errors happen in frontend code all the time. It's just that most of the time bugs don't crash the app, and the user doesn't know what to expect so they see a broken feature and think it's meant to be like that.

uw_rob · on Nov 15, 2023

I don't think it's fair to consider the updaters for either Chrome or the OS to be simple.

donmcronald · on Nov 14, 2023

> While I fully understand that this is hard to get right 100% of the time, a mess up of this level by a car manufacturer is pretty amazing to me.

I feel like it's going to happen to someone that makes network devices eventually. I'm always scared to update my (several hundred) UniFi devices. Their update process isn't foolproof and they push auto-updates via the UI pretty hard.

Several years ago they caused some people's devices to disconnect from the management controller when they enabled 'https' communication. Prior to that, if you were pointing devices at 'https://example.com:8080...' they would ignore the 'https' part and do an 'http' request to port '8080'. Then they pushed their 'https' update which expected an 'https' connection and didn't fall back to the old behavior for anyone that was mistakenly using 'https' in their URL initially. Some people on their forums complained about having to manually SSH to every device to fix the issue.

It was caused by an end-user mistake, but they knew it was a potential issue. AFAIK, their attitude on it hasn't changed and a lot and at the time their response was that they knew it would break some people, but that it wouldn't be that many (lol).

IMO, the issue with those systems is that basic communication back to the update / config server is part of the total package which is too complex (ie: a full Debian install). I'd rather see something like Mender (mender.io) where the core communications / updates come from a hardened system with watchdog, recovery, rollback logic.

Think of how crazy it is to have something like pfSense doing package based updates rather than slice based updates. At least with boot environments they could add some watchdog and rollback type logic, but it'll still be part of the total system instead of something like a hardened slice based setup where the most critical logic is isolated from everything else and treated like a princess.

Do you have any insight on package vs slice based systems for updates? Did you isolate update logic from the rest of the system or am I out of touch with that opinion?

vGPU · on Nov 15, 2023

Reminds me of my (far less critical) update process for home assistant. Every time something breaks. Currently my hvac automations are going haywire.

akira2501 · on Nov 15, 2023

When possible, I used a fail back mechanism. If the update failed to fully come up, then the watchdog timer would catch it, the bootloader would notice the incomplete boot, and attempt to boot from the previous known working image in that case.

code_runner · on Nov 14, 2023

out of morbid curiosity.... how long did it take to ssh into and fix all of those servers? I imagine even automating a fix (if possible) would still take a good amount of time.

latchkey · on Nov 14, 2023

gnu parallel and sshpass is your friend.

The way I built my app was that I could install it cleanly via a curl | bash.

So, I just had a simple shell script that iterated through the list of IP addresses (from the DHCP leases), ran curl | bash and that cleaned up the mess pretty quickly.

jdechko · on Nov 15, 2023

As a non-developer, the whole situation with a bad software update to the Voyager spacecraft really puts things into perspective as far as how bad remote updates can be.

It’s also a testament to the way that the system was designed that they were able to get it back online.

sixtram · on Nov 15, 2023

you ssh-d into 25K servers one by one? I mean, manually?

latchkey · on Nov 15, 2023

https://news.ycombinator.com/item?id=38270986

ugh123 · on Nov 15, 2023

Please tell me you scripted that ssh into across your 25k servers!

latchkey · on Nov 15, 2023

https://news.ycombinator.com/item?id=38270986

One thing my little control process did on the box was to always set the password to be the same... user/1.

None of these boxes needed inbound connections, so it wasn't a big deal to do that.