Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Meta patches Linux at hyperscale (thenewstack.io)
135 points by elorant on Dec 2, 2023 | hide | past | favorite | 105 comments


Ksplice is the original live patching technology that got bought by Oracle and was later extended to user space programs while I worked there. It's a really neat technology that isn't made obsolete by the move to cloud, since you still don't want to have to restart the whole fleet at scale.


What's the live app patcher called?


I have objections to the name, but it goes under Ksplice for User Space.

https://blogs.oracle.com/virtualization/post/ksplice-zero-do...


How was Oracle to work at?


Post acquisition, we moved offices, and used their HR department, and certain product decisions were made for us, but we largely did our own thing, for better or worse, so I don't know how much I can speak about how it was to work at Oracle. We were under Wim Coekaerts who is a big open source guy. Oracle's reputation as a cutthroat legal entity is well deserved, but working at that side of the company it felt unfair because, Oracle's open source contributions with VirtualBox, the UEK, and others are lost in the grar over MySQL.

I think the biggest "big company" blunder while I was there was a public blog post by someone high up at the company decrying open source as bad/wrong while at the same time, Oracle was doing all this other open source stuff.


> but working at that side of the company it felt unfair because, Oracle's open source contributions with VirtualBox, the UEK, and others are lost in the grar over MySQL.

They are also remembered for closing OpenSolaris and shaking people down over the VirtualBox extension pack ( https://www.theregister.com/2019/10/04/oracle_virtualbox_mer... ).


> grar

Useful word.


Funny given that they have been selling open source software for ages...

https://docs.oracle.com/cd/B10463_01/web.904/b10320/apjsvsup...

I believe this was circa 2000, but going on longer than that.


Not OP but I loved oracle in the 90s/early 2000s. After the Sun acquisition it became cutthroat and I left. IBM acquiring redhat seems very similar


I wish they mentioned how long a full deployment takes Meta using this method, that seems like an important detail to omit.

> So, if you’d rather not have downtime with your servers, data centers, and clouds, follow Meta’s example and use live patching. You’ll be glad you did.

Maybe if you're working at Meta's scale it makes sense... But I think most well designed services and applications should be able to get by just fine with a full reboot of any single server. I can't really fathom the complexity of managing millions of servers though.


> Maybe if you're working at Meta's scale it makes sense... But I think most well designed services and applications should be able to get by just fine with a full reboot of any single server.

I feel like this should be the opposite... I don't work at Meta scale, but I do work for a CDN with 10s of thousands of servers, and everything we do is based on the idea that some machines will always be going down, because some hardware is going to fail every day just from probability. You have to design everything for failure.

Given that, it shouldn't be hard to take servers out of production for patching and updates.

In other words, a hyperscaler is going to have less incentive to minimize down time than smaller shops.


I don't follow. Reboot is downtime. Of course your architecture must allow for downtime if it happens, but it's lost money either way, your hardware is not doing any useful work while rebooting. So more computers you have, more money is lost. At small scale that's not significant, but at large scale that might become significant so there's more incentive to reduce downtime.


A reboot, a software deployment (kernel upgrade), server replacement, etc. are all the same process. That simplifies things dramatically. You can micro-optimize the 30s it takes to reboot a server, or you can simplify a runbook to have one process for any “deployment”. Different scenarios require different things but for most “web scale” things that need to be overprovisioned anyway, I’d take the simpler process.


These servers don't take 30s to reboot. Some servers take many minutes. It's a lot.


Worse, some just don't come back without manual intervention. Power supplies don't last forever and might run fine while the machine is on, but after a reboot... boom, gone.


I'd prefer kexec to kpatch, then


Spin new servers up before you take the old ones down. Effectively zero loss of time for that service.


Sounds like something to fix rather than to paper over?


Isn't it more significant at smaller scale? That is, if you have less computers running to serve requests, the downtime of the singular system will be more pronounced (as opposed to rebooting one machine out of 20 in a rack).


If it isn’t an emergency patch, we do all our maintenance at low traffic times (e.g. the middle of the night local time for the data center). Your capacity planning is based on peak traffic, so you can afford to have more machines out during low traffic times.


Yes, you need to overprovision the server.a little bit.

But you got a much simpler process.

Process ain't free either.


> You have to design everything for failure. Given that, it shouldn't be hard to take servers out of production for patching and updates.

There's a big, big difference, especially at the scale of hundreds of thousands / millions of servers, between designing such that your architecture can suffer 1% of servers being offline and 10% of servers being offline. If you have 1 million servers, even if you could take 1% of them offline at once (i.e. 10,000 servers), if it takes 5 minutes to reboot, you then need to wait 5 minutes * 100 one-percent-buckets = 500 minutes, or 8.3 hours to do a full patch. When you have critical security updates (like Heartbleed) you simply cannot have unpatched servers exposed to the Internet for that much time. And that's not including the amount of time it takes to actually send reboot/patch commands to 10,000 servers.

The larger the bucket, the more likely a bad patch is noticed by the public, and the more likely that an ordinary traffic spike (for which your extra capacity is ordinarily there for) will overwhelm your servers (since your extra capacity is being used to handle patch rollout and rebooting). Sure, you can plan and add even more capacity to compensate, which makes it take even longer to rollout a patch, and now Finance is knocking at your door wondering if you really need all these servers and maybe you could decommission some of them to save money.

It's a fundamentally difficult problem at hyperscaler scale.


I think AWS does pretty well with that philosophy at hyper scale.

AWS has millions of servers in a single AZ.


If you've seen any of Werner Vogel's talks, you will notice that Amazon and AWS feel the same way. Any scalable service should be able to withstand the loss of some of their components and keep operating.


Edit: ignore the below numbers, I got hours and minutes confused.

If you each sever spends 7.5 minutes each month rebooting, that is 1% of all your computers wasted. If you have 10,000 servers, that’s worth 100 severs. If you have 1 million servers, that’s worth 10,000 servers. If each server costs $10,000, that’s $100 million dollars of compute capacity. You can see how that amount of lost computer capacity can start to justify spending engineering time on driving down the amount of time servers are rebooting.


There’s 1,440 minutes per day. In a month with 30 days that’s 43,200 minutes per month.

7.5min/43,200min = 0.00017361

Where are you getting 1% of all your computers wasted per month???


Sorry I got hours and minutes confused and overstated the benefits.

So the correct numbers would be 7.5 minutes of downtime divided by 43,200 minutes times 1,000,000 servers. That’s 173 servers wasted. That is probably still enough servers wasted to devote some engineering time to increasing utilization.


Capacity is measured by peak usage, but most data centers experience a daily traffic cycle with low times at night. We just patch/upgrade our servers at the local low time, when traffic is much lower and a lot of your machines are idle.


https://youtube.com/watch?v=ILTqn1EYIXQ is the original talk, which says it takes 4 days to deploy a KLP to the whole fleet.


There's more in-depth information on that subject here:

https://www.usenix.org/conference/osdi23/presentation/grubic


[flagged]


Ads are just what pays the bills. The cluster is also used for all sorts of useful things, like mass communication, especially during emergencies, support groups, and providing information to help manage a pandemic.

Software engineers aren't nuclear physicists or machinists. I'm not sure how you want to use them to build nuclear reactors.


People become software engineers instead of choosing a different profession because there is so much money in ads.


And lawyers only choose that profession to ambulance chase, and doctors only choose that profession to do boob jobs. People get into software development for more than just money, y'know?


It’s a very reductive take. Meta is helping a lot of people get and keep in touch for free (like my family is all over the place and we use WhatsApp extensively to communicate). There’s clearly a demand for this, and doing it for the price of a couple of ads doesn’t seem sad to me


Did Meta create WhatsApp? Or did they just buy it to sell ads and track its users?


No, the users are the products. It’s not free.


You’re not paying them out of your own pocket. I’ll happily use a better word for that if you have one.


Ass, gas, or grass; nobody rides for free. In this case you're paying with your ass. You're turning tricks for Zuck.

Meta pimps your mind out to anybody who wants to take a crack at you, to manipulate your political views or extract money out of you. In return, they give you purses and pay for you to get your nails did.


It’s not free.


Ads act as a recommendation engine, helping people find solutions to their problems sometimes before they ever realize there is a solution.


Ads are intended to convince people to spend money. In a perfect world that might mean actually being useful solutions, but I have zero faith that that's even a passing concern in reality.


Yeah and casinos offer people some good clean fun. Definitely no incentives here to ”suggest” things people they don’t need or benefit from having, right?


i dont know what ads you are getting, but most of it is ether trash dropshipping products, gambling products(often bad mobile games) or other worthless stuff.

there are some plattforms with relative "high" quality ads, but facebook is not one of them.


lol common now. It’s to make Facebook a bunch of money. Not for tell the future.

I use facebooks products and it’s never helped me or anyone I know in the way you describe.


There is also KernelCare from tuxcare https://tuxcare.com/enterprise-live-patching-services/kernel... for kernel livepatching. It supports most linux distros.


Roughly 13 million servers at present, with $20B spent on new datacenter gear in 2023 and another $20B in 2024.

Running CentOS 8 Stream currently, going to 9 soonish.


Been using Kpatch with tons of my VMs, works quite well


"Draining and un-draining hosts is hard."

I'd stop right there and fix that, because that's a bullshit reason. Cycling hosts in and out of service is easy unless you're not doing things properly.

The Linux kernel is simply not designed to be live patched and it's a total hack to try to do it, it will never work 100% of the time, always be a source of uncertainty, and always be expensive in terms of engineering work. Disaster will always be looming.

By contrast, fixing their system for taking hosts in and out of service, so that it's extremely robust and reliable would likely pay big dividends in reliability.

My guess would be that this approach is papering over organizational dysfunction. One team can patch all the kernels but one team can't make all the hosts support proper cycling in and out of service. And no one cares to fix it because there's no real incentive to do so. Only cool hacks and new projects are properly rewarded.


> fixing their system for taking hosts in and out of service, so that it's extremely robust and reliable would likely pay big dividends in reliability.

Facebook hosts can be robustly cycled. Of course. They've been doing this stuff for years. They've figured it out. That's not the issue.

Scaling up brings about new problems. This article specifically mentions the 45 day rolling restart issue. That's not an issue when you have 1000s of servers. It's one that shows up a couple of orders of magnitude later.

So you either solve the problem with a hack like kernel patching or you work to reduce restart times (drain + shutdown + OS restart + process initialization across every service). Get those restart times down 50% (good luck accomplishing that) and congrats, you're down to maybe a 25 day rolling restart, which is still quite a problem.


What exactly do you think requires cycling millions of hosts to take 45 days? That's a couple hundred hosts every five minutes across a large number of datacenters?

I wouldn't expect it to be halved by optimization. I'd expect it to be an order of magnitude faster and take more like 4.5 days.

I wouldn't be surprised (but I would be impressed) if they tried and got it down to a full cycle requiring one working day. That's around 1% of hosts cycling every five minutes.


It's certainly weird. I've worked at ~million host scale where uptime never exceeded a week by default (with the odd carve out for problem child software supplied by vendors)

I'm betting it's moreso that teaching developers to write software that tolerates draining properly (or is even able to communicate draining) is too difficult for them so they work around it.


> It's certainly weird. I've worked at ~million host scale where uptime never exceeded a week by default

How many individual teams had software running on your hosts? How many those hosts were stateful, and were fragmented across hundreds or thousands of service groups that had their own fault tolerances and unknown (to infra team) warm-up times. Adding complexity (rolling reboots) to already complex systems is almost never a good idea - at some point, there will be an issue caused by hosts rebooted in the wrong order, or too many hosts of a certain type 2-dependency-levels down being simultaneously offline


I appreciate your attempt to invalidate my experience but your points are irrelevant.

> hosts rebooted in the wrong order

Order doesn't matter. Host groups set a threshold for unavailability. Hosts are not rolled unless availability targets are maintainable. Usually this just means the oldest host at any time will get rolled.

Facebook obviously uses hot patching kernel updates to work around a social issue. Instead if you are functionally able to prescribe a set of behaviors that teams must comply to, you can easily do things like rebooting the fleet monthly without impacting availability regardless of the statefulness or fault tolerances.

If I shoot a random host in your pool and it matters to you then you haven't achieved fault tolerance. I'm obviously not proposing shooting an unfair number of hosts to you.


> I appreciate your attempt to invalidate my experience but your points are irrelevant

That was not my intention - I genuinely would have appreciated answers to my questions as it be useful to compare the complexity of your setup versus Facebook. As an extreme case: million homogenous, stateless hosts are far less complex to manage compared a million heterogeneous, stateful ones, and very little translates from the former to the latter - in my experience.

> Facebook obviously uses hot patching kernel updates to work around a social issue.

Which I think is reasonable when you have tens of thousands of SDEs.

> I'm obviously not proposing shooting an unfair number of hosts to you.

I agree with you, but I'll go on to say "not shooting an unfair number of hosts" is a hard problem to solve at scale, unless you're willing to make it simple and make humans deal with it by continually draining/undraining services which costs a lot of money without increasing the top line, likely far more money than it cost to get a handful of engineers to write kernel splicing. So beyond it being possibly a social issue, it may be a cost/host utilization issue as well


These are fair points but I'd add that continually draining amortizes to $0 as the fleet grows.

Even if you can splice there are benefits to limiting uptime, with maintenance reaping the majority.


The amount of time needed to restart a fleet (without sacrificing availability) is correlated with excess server capacity. Excess server capacity is not free.


1 Million hosts over 45 days = 15 min per host

That's a very realistic/optimistic number (especially as you do want to wait for all services to be running and marked as healthy)

"Oh but you can batch this" sure, but you don't want too much of a big batch that will make your service slow or want to risk shooting yourself in the foot - like rebooting your whole control plane then figuring out it doesn't work like that

(The 45 days is probably an estimate as well, I'm not sure they actually do that server by server)


That's still ridiculously slow. I'd expect them to have hundreds of Microservices. Each one of those should be able to handle a random restart at any point in time so they should absolutely be able to restart 100s of servers concurrently without major disruptions. Hell on Facebook scale a whole-Datacenter going down should not cause service disruptions.


This does assume that nothing is getting broken along the way.

Taking 45 days is probably more about caution and resolving issues systematically rather than pushing a big button and hoping you don’t cause issues.

I’d expect them to have thousands of microservices - and you only have to find a way to break one to cause big issues.


Regular random crashes should be exercised regardless at Facebook scale. Not being resilient to that would be very unprofessional.


That's not 15 minutes per host; that's 15 hosts per minute.


Why do you think its easy?

There's a lot of systems where you can easily take down some hosts, but taking down more than N% at a time causes issues. If your fleet is large enough then you are limited by the largest set of hosts where you can only take N% down at a time. Now you could say keep the sets of hosts small or N% large. But that can cause other issues as you typically lose efficiency or zonal outage protection.

A solution to this could be VM live migration or something similar. This breaks down for storage systems where you can't just migrate those disks virtually since they're physical disks or places that don't use VMs.


It may be easy to cycle hosts in and out, but it can also be time consuming, apparently. In the article it mentions taking 45 days to patch all hosts. The article also points out that this is too long for security updates.

Nothing will work 100% of the time. If their patching mechanism is thoroughly tested and battle hardened, I think the risk would be acceptable. Once you do the initial kpatch security upgrade, you could even schedule the machine for serivce so that it's not relying on that, limiting your exposure to bugs.


At google we did pretty much the same thing. Aimed to be able roll a kernel in 30 days, but various edge cases always made it drag out at the end unless you really spend a lot of human time on it. So use kaplice for really critical stuff (where the patch was easy, not always the case).

A reasonable compromise in the real world.


At redhat, we just maintain kpatch and hire kernel engineering to do the same. Turn around is about a week for 40 variants of kernel.


It's not a bullshit reason. You can put all the lipstick you want on the pig and put software all around it to make it all sorts of easy, but at the end of the day, having to reboot is a stop-the-(machine's)-world situation. Not having to do that is just better. Even if it doesn't work 100% of the time, that's still better than having to reboot the whole fleet. 45 days to reboot the whole fleet!

Throwing FUD and saying disaster is looming because its scary computer magic (out of MIT) was a scare tactic RedHat used to throw around about Oracle/Ksplice until they developed their own (Kpatch), then suddenly their sales team had to backtrack and say actually hot patching is good and can be trusted. I'm not saying it's not risky or dangerous, it's operating in kernel space, but that's why they pay really smart people to be careful when doing it, and not digital equivalent of a plumber who can't do more than glue libraries together.

A better understanding of the underlying technology so it's less magic might assuage your fear of it, but thinking Facebook is so dysfunctional that they haven't already made it easier to reboot is to misunderstand the problem at hand.


The work that makes cycling hosts in and out easy is itself hard.

I agree that it’s the right thing to do but it’s hard.


> My guess would be that this approach is papering over organizational dysfunction.

So what? At large enough scale organization problems are harder than technical ones: if you can fix the former with the latter, that's still a win.


But it's not "fixing" the problem, it's papering over the problem by piling tech debt on top of tech debt.

Framing this like it's a good thing is my only objection.


Yeah, especially with containerisation and orchestration / Kubernetes, I get that perhaps not everything is viable to containerise, but in 2023 this feels archaic and like a lot of (potentially unnecessary) engineering work.


Redhat provides kpatches for 6 months, that's all. If you are running a year old kernel, no kpatches are provided for that kernel. Definitely, one needs to recycle hosts every six months.


> So, if you’d rather not have downtime with your servers, data centers, and clouds, follow Meta’s example and use live patching. You’ll be glad you did.

Most orgs don’t need and won’t benefit from emulating Meta for the sake of emulating Meta.


This sort of criticism gets repeated all the time ("Google designs for Google scale, but you're not Google, so don't use kubernetes!"), and sometimes it's fair, but this doesn't really make sense to me on this particular article.

If the infrastructure exists within your org's distribution of choice to do this, it's basically all upside. On AL2023, you just do:

`sudo dnf install -y kpatch-dnf kpatch-runtime`

`sudo dnf kernel-livepatch -y auto`

`sudo systemctl enable --now kpatch.service`

Super simple, one less thing to worry about.


> Super simple, one less thing to worry about.

We got bitten by kpatch a few times before decided to abandon it around 2017.

From kpatch github (2023):

WARNING: Use with caution! Kernel crashes, spontaneous reboots, and data loss may occur!


> Most orgs don’t need and won’t benefit from emulating Meta for the sake of emulating Meta.

SO. Much. This.

Worked at more than a few places where the stack had almost as many layers as it had engineers "because this is how we did it at FAANG..."

Right, and those places also had a few orders of magnitude more engineers on staff to support it all. We do one hundredth of the things FAANG does and we have less than 50 _total_ people in the company; the simpler the stack, the better.


The danger in your comment is that people who are primarily "working" because they want to play on someone else's dollar with some new, frivolous tech - from the perspective of what their business actually needs - are going to be insulted by this. Good luck.


Likewise at Meta there's a bunch of cargo-culting from Google.


Could you expand please?


Never heard of this "hyperscale" concept before. How is this any different from... scaling?


In general every order of magnitude brings new challenges. Companies running over a million servers have a lot of problems that smaller ones don't. Also they just have more room to amortize R&D.


The "H" in ADHD stands for "hyperactivity". This implies _significantly_ more activity than the norm to a point where they really stand out.

In sci-fi a spaceship travelling at "hyperspeed" was perceived to be so much faster than anything known to man that it would be difficult to comprehend.

The performance and cost of a hypercar compared to the average family car is sometimes difficult to understand too. The average person would have to work (potentially) hundreds of years to afford a €10M hypercar.

"Hyperscale" is so large that even us working in tech have difficulty grasping it because we have nothing tangible to compare it to. A million servers is bind boggling to me even with the "cattle not pets" mindset


Supermarkets and super cars aren't enough. We need hypermarkets and hyper cars to be current. So now we can’t just scale, I guess that’s for trucks, so to be current you need hyperscale.



Cybertruck and cyberscale?

(I'm joking)


give it a couple more years, we'll have ultrascale


And then the wiiscale?



> In computing, hyperscale is the ability of an architecture to scale appropriately as increased demand is added to the system.

So yeah just scaling. I agree I've never heard the word "hyperscale" before and don't think we need that extra intensifier for a well-understood idea.


The word hyperscale afaik was coined by Wall St people as a collective noun for FB, Google, MS, et al, in the context of their in-house data center operations.


1x, 10x, 1000x are all "scales", yet problems may be a little bit different at each of them


I'm just quoting the wikipedia definiton of hyperscale. Nowhere in there does it say anything about 1000x, probably because that is an ill-defined concept.

1000x what? Today's computers are a 1000x the ones from the 90s, should we call them all hypercomputers? Pretty much any startup can boot 20,000 nodes on aws, are they all hyperstartups hyperscaling?

Seems like marketing nonsense.


I'm guessing you're mad about the term but not the concept? And that you do agree, that scaling from 1 to 10 is not the same animal as scaling from 10 to 100? So why, then, call both animals "plain and simple scaling"?


> And that you do agree, that scaling from 1 to 10 is not the same animal as scaling from 10 to 100? So why, then, call both animals "plain and simple scaling"?

So you think we need 4 different words for scaling 1-10, 10-100, 100-1000....?

Cut it out, you know it's just marketing hype. Not everything needs its own word.


It makes sense to have different words for totally different orders of scaling, yes.


>1000x what? Today's computers are a 1000x the ones from the 90s, should we call them all hypercomputers? Pretty much any startup can boot 20,000 nodes on aws, are they all hyperstartups hyperscaling?

Why compare to the past capabilities, wtf?

>Pretty much any startup can boot 20,000 nodes on aws, are they all hyperstartups hyperscaling?

Now think how many nodes can Google, Microsoft, Fb, etc. run in their 10s of datacenters.


I dislike the term 'hyperscale' as it lacks concrete meaning. We need a metric akin to kilobytes or megabytes for storage. Terms like 'kiloscale' and 'megascale' could better indicate the scaling range of a service. For instance, scaling a service to a thousand instances and back to zero or rolling out patches to thousand instances could be termed 'kiloscale'.


That's why I brought up the question, and I'm baffled why so many people have been downvoting it. To me, its use seems very specific yet nothing I read about it told me anything concrete. From what I can tell, it actually is meant to broadly mean the extreme end of scaling. The threshold is still unknown to me, but I'll just take it that Meta is hyperscale.


Universe scale i guess.


Size of the entire universe man


Multiverse.


Meta already does that.


Yeah, and this isn't a problem that interacts with scale. If you can patch 100 servers with an automated method you can patch 1,000,000 of them.


This is absolutely a problem that interacts with scale. With 1M servers you’re almost certainly dealing with hundreds of service owners, and some of those are going to need additional features you don’t have to worry about with 100 servers. Some examples are databases with graceful failover, long running AI model training jobs, or distributed databases like etc where you have to be mindful about how many can be down at a time.

It’s not 10,000x harder to patch that 10,000x more machines, but it’s not 1x either. Easily 10-20x harder, if not more.


And at a certain scale patches will come out faster than you can deploy them!


What you're describing has nothing to do with the number of servers and everything to do with the number of services.

That's not scale, that's organizational sprawl.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: