Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't follow. Reboot is downtime. Of course your architecture must allow for downtime if it happens, but it's lost money either way, your hardware is not doing any useful work while rebooting. So more computers you have, more money is lost. At small scale that's not significant, but at large scale that might become significant so there's more incentive to reduce downtime.


A reboot, a software deployment (kernel upgrade), server replacement, etc. are all the same process. That simplifies things dramatically. You can micro-optimize the 30s it takes to reboot a server, or you can simplify a runbook to have one process for any “deployment”. Different scenarios require different things but for most “web scale” things that need to be overprovisioned anyway, I’d take the simpler process.


These servers don't take 30s to reboot. Some servers take many minutes. It's a lot.


Worse, some just don't come back without manual intervention. Power supplies don't last forever and might run fine while the machine is on, but after a reboot... boom, gone.


I'd prefer kexec to kpatch, then


Spin new servers up before you take the old ones down. Effectively zero loss of time for that service.


Sounds like something to fix rather than to paper over?


Isn't it more significant at smaller scale? That is, if you have less computers running to serve requests, the downtime of the singular system will be more pronounced (as opposed to rebooting one machine out of 20 in a rack).


If it isn’t an emergency patch, we do all our maintenance at low traffic times (e.g. the middle of the night local time for the data center). Your capacity planning is based on peak traffic, so you can afford to have more machines out during low traffic times.


Yes, you need to overprovision the server.a little bit.

But you got a much simpler process.

Process ain't free either.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: