Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Terrance here from Google Cloud Support.

There are only 3 things I can say about this situation. 1) These issues are currently unrelated. 2) We learn a lot from these situations. 3) A lot of these types of issues can be mitigated by running in more then 1 region.

I really cant promise that today's situations will never happen again. There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.



“You should be using more than 1 region” could also be “you should be using more than one provider”, no?


To somewhat echo BurritoElPastor's comment, running a system/app that can be run in multiple clouds is orders of magnitude more difficult than just running a system/app that can be run in multiple regions.

And, not to be snarky, but many of the other responses that are along the lines of "It's not really that difficult to run in multiple clouds" - let's just say I have trouble believing these commenters have real world experience actually doing this. I'm not saying it's impossible, but it is extremely difficult for any system of reasonable complexity with a dev team of, say, 10 or more people.

And, if you can stomach the cost, you do give up the ability to really use any of the proprietary (and often times awesome) functionality of a particular provider, which can put your dev velocity at a big disadvantage.


It's not trivial but it's also not an order of magnitude more difficult anymore, as you describe it. There is a reason why Kubernetes gets a lot of backing from corporate customers - precisely because it hides and abstracts most of the underlying infrastructure and provides platform-agnostic primitives that make sense at the application level.

Once you have deployed your stack on Kubernetes, you can pretty much run it on any cloud or infrastructure with minor tweaks at most.


It's quite common in cloud solution design to design for failure. One of the common assumptions that we hold to is that one region may go down. Other examples: Assume an instance of an app can go down. Assume a VM can go down. Assume a DC can go down.

This is not to excuse the downtime in any way.


Do we need a new definition for RAID level?

Redundant Array of independent Data Clouds.

I guess for RAID 5 would I need a min of 3 regions or 3 separate cloud providers.


Do people ever worry that an entire cloud provider may go down, or is that too unlikely of a case?


However much we technical people might salivate at the prospect of designing a multi-cloud solution, for the vast majority of businesses it simply isn't worth the cost / complexity. I'd wager 90-something percent of applications could suffer multi-hour outages without impacting business function to any measurable degree.

Plus the fact that without serious investment, you're probably more liable to decrease availability by going multi-cloud thanks to the increased system complexity.


The real trick here, which many people don’t want to look at, is to avoid overly centralizing your workflow.

I can get a lot of work done while Outlook is down. Hell, probably more work done.

If our build server is down I can work for a couple hours (unless we’ve done something very bad). Same for git or our bug database or wiki or or or. When I get stuck on one thing I can swap to something else every couple of hours. And there is always documentation (writing or consuming).

But if some idiot, hypothetically speaking of course, puts most of these services into the same SAN, then we are truly and utterly screwed if there is a hardware failure.

Similarly if you make one giant app that handles your whole business, if that app goes down and there are no manual backups you might as well send everybody home.

I went to get a drink the other day and the place looked funny. They’d tripped a circuit breaker and the whole kitchen lost power. But the registers and the beverage machines were on a separate circuit. And since they sold drinks and food in that order, they stayed open and just apologized a lot. Whoever wired that place knew what they were doing.


Probably lost 1 of 3 phases. You're quite right in that the decision of what phase a circuit is on has a lot to do with business, and hopefully no major repurposing of the space without rewiring the space has occurred. For lighting, you'd want 1/3 of fixtures per room to go out, not 1/3 of your rooms in their entirety. For appliances and receptacles, you'd rather lose a whole function (the kitchen) than be able to cook but not do dishes, with every function trying to figure out oddball workarounds.


The chance that AWS goes down is much smaller than anything else going down. There are many SPOFs in a typical smaller company setup, most of those are not even obvious to the operators.


In the past ten years:

It’s happened more than once with Azure and GCP. I think it happened once with AWS, but not positive there.


AWS had a multi-hour total S3 outage in us-east-1 in February 2017 that knocked out a huge number of things mostly because it turns out that a huge share of their customers run in only 1 region and it's us-east-1. Things mostly continued to work in other regions.

I recall Azure had some sort of multi-region database failover disaster that took several regions offline, and GCP has had several global elevated latency/error rate events, but I don't think that any cloud provider has been "down" in the sense that the word is usually used.


GCP (and all of Google) was down worldwide in 2013 as one example:

https://www.theregister.co.uk/2013/08/17/google_outage/

Here’s one that’s on Azure. Not a 100% total outage like above, but bad enough most I know in the industry would call it being down:

https://www.zdnet.com/article/windows-azure-suffers-worldwid...

If I get a free moment, I’ll dig up other examples, but those were ones that were easy to find.


Billing issues can take down your entire account at a given cloud provider all at once.


It’s a legit concern, but it adds complexity that will probably cause more outages than the thing you are worried about.

IMO, you’re better off with a private data center or colo and separate integrations with cloud.


I don’t think it’s happened (yet) although some of the earlier outages when AWS was younger were pretty far reaching. I think all of S3 has gone down a time or two.


All of S3 has, but that’s because S3 had a single choke point in a single region for a long time.


> All of S3 has, but that’s because S3 had a single choke point in a single region for a long time.

The only S3 event here was limited to us-east-1: https://aws.amazon.com/premiumsupport/technology/pes/

Some APIs were impacted, because they are global by nature (e.g create-bucket). But S3 was working fine in all other regions, for existing buckets.

However, many websites were affected, because they didn't use any of the existing S3 features that allow for regional redundancy, simply because S3 had been so reliable they didn't know/think they needed to have critical assets in a bucket in a 2nd region that they could fail over to.

Admittedly, even the AWS status page was impacted, because it also relied on S3 in us-east-1.

S3 has done a lot of work to improve matters since, and mechanisms have been put in place to ensure that all AWS services don't have inter-region dependencies for "static" operation.

However, it is still incorrect to claim that it was all of S3. Many customers who use S3 only in other regions were totally unaffected.


All of S3 create-bucket is "all of S3" for a lot of use cases and customers.


Well, sure, if you hate your devops team and you want to make sure they can’t use any of the proprietary functionality of either provider. At which point, if you want to be managing a fleet of vanilla Linux boxes yourself, why use a cloud provider at all?


* You should not be locking yourself into proprietary functionality of a cloud provider unless you are deeply interested in what happened to Oracle customers getting raked over the coals happening to you.

* DevOps teams can be multi-cloud relatively easy when using infrastructure as code tooling (Terraform, Packer, etc) and traditional DevOps practices

* Why manage a fleet of vanilla boxes when you can use vanilla boxes with Kubernetes and not get gouged by cloud providers in the first place?

You don't need to jump off the hype train if you never got on in the first place.


Proprietary managed services can save a lot of dev/setup/SRE time though. Many businesses have more pressing things to work on than spending dev time to prevent vendor lock-in.


Everyone spends their runway differently. Once you’re off the ground, derisk.


Most companies don't have a "runway", they are just bootstrapped and have to actually justify their expenses and lock-in every day.


if I voluntarily choose a provider at a price that’s acceptable to me am I being gouged?


Not yet, but it seems obvious to me that the GP was referring to a situation where the price changes and then you are getting gouged. That's exactly what the negative connotations of lock-in refer to.


Each provider will seek to make you take their one true path, or you need to do your own engineering.

Using the providers path isn’t necessarily gouging, but it isn’t cost optimized either. The answer depends on you.

That said, cloud is like any tenant/landlord relationship. Your rights are linked to time and are whatever your contract provides. If you didn’t like Office 2007, you didn’t buy it. If you don’t like Office 365, 2021 edition, too bad.


It's not quite that black and white. You can use common/open APIs and cross-provider tooling whenever available and provider-flavored ones where necessary. It's more effort, but still less than hand-rolling everything.

Of course that only works as long as you're swapping out largely replaceable parts. If you built everything around some proprietary service then yeah, you've tied yourself to that anchor.


This seems overly negative. There are lots of ways to do hybrid clouds, especially if you’re doing it for only the more critical parts of your application.


> why use a cloud provider at all?

Cost+speed of scalability, and managed services. If you rarely need to scale, your workloads are all predictable, and you don't need managed services/support, you should just buy some VPSes or dedicated boxes.


Staying on current versions, and the ability to scale usage up and down?


Why would you want to lock into a cloud provider? You're losing a lot of operational flexibility for less devops and sysaadmin work.

You are really limiting your tech stack by using standardized things like Jenkins, Docker, K8, mqtt, kafka.


It's not really that I "want to lock into a cloud provider". Sometimes I simply don't have the human bandwidth available to handle devops and sysadmin work while building the actual product.

"Outsourcing" those functions to cloud services can be big win for a small team. Like all engineering, it's a trade off.


For the same reason you want "to lock in" (meaning use) any solution. You do not want to build or operate it yourself. Why don't you take this further? Why to use a water utility if you can just drill your own wells? Most businesses are better of on cloud because their core business is not to build and operate datacenters but provide services to their customers (on the top of datacenters running their apps).


If you're running in multiple clouds for HA/DR reasons, you are limited to the lowest common denominator of features/services between them. Or maintaining multiple codebases/architectures, and the massive pile of issues that entails. I am not a fan of multi-cloud for this reason.

Multiple regions, as long as your provider offers all of the services, you can have a carbon copy. Much easier.

It depends on your needs, your architecture, your risk tolerance, etc. I think for most people "Use multiple regions" is the answer that strikes the correct balance. It probably isn't the correct answer for everyone.


> you can have a carbon copy. Much easier.

Certain terms and conditions may apply :) Carbon copy of a static website or one whose data is only a one-way flow from some off-cloud source of truth? Sure! Multi-master or primary-secondary with failover? Stray too far from the narrow path of specialized managed solutions and things get very complex, very quickly. That being said - it's mostly just the nature of the beast. If you're not able to tolerate a regional outage, multi-region is a pill you're going to have to swallow, no buts about it.


This is one of the reasons things like Federated Kubernetes is being worked on. Stick a CDN in front and your compute can be migrated from cloud to cloud. You still need to do a lot of thinking about data though.


Three CDN's. And three DNS providers.


Maybe. If you get a billing issue or get marked as suspicious, you can lose all services with one provider.


More than one region is pretty easy, more than one provider is harder (especially if your workload is designed from the ground up for it.) But, yes, just as multi-region protects you from things mere multi-AZ doesn't, multi-provider protects you from even more.


Yes.


I have an awesome demo I give running a complex stateful workload across cloud providers to show off the system that I work on. What I have learned from giving that presentation many times is that while it is nice to say you can run cross cloud, for most workloads you should just pick one cloud, and be able to move to another provider if you ever need to.


Is it practical to use several providers when egress is so expensive?


No, not unless you are someone like Netflix. Usually you can configure multi-region failover and such and that will keep your things running. It is more expensive but for most use cases I think the cost is still less than the dev time/complexity of setting up multi-provider workflows and the inevitable duplication of resources (which is part of the cost of multi-region anyway)


No. And there's been a lot of talk recently about multi-provider being the right strategy to mitigate downtime, which IMHO is a farce peddled by expensive consultants. The parent comment is correct - this is why availability zones and regions have been established by each provider.

For the large majority of businesses investing in infrastructure-as-code far outweighs any crazy HA, redundant, multi-provider, whizzbang whatever setup you may have.


> this is why availability zones and regions have been established by each provider.

But the degree of independence provided by AZs is not constant across providers, despite similar terminology.


You can move 1.6TB between providers in a month for the same price as a single beefy DB server (m4.16xlarge here). That's a whole lot of logical replication..


Depends on your use-case.


You are comparing one overpriced SKU to another over priced SKU.


> There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.

Are you implying that the cause of this outage is not Google's fault? If so, can you go into more details about that?


> The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles serving network paths in us-east1, and we expect a full resolution within the next 24 hours.

From the dashboard. Looks like this can be blamed on an Act of Backhoe.


Not him but oftentimes cloud outages can be due to issues with the network connections to the datacenter, or power outages.

Datacenters also sometimes have other single points of failure such as DNS, but those are within the company's control.

https://www.networkworld.com/article/3373646/network-problem...

https://www.datacenterknowledge.com/uptime/equinix-power-out...


But data centers are typically designed with network and power failures in mind, not? Isn’t this why these kind of ring based network topologies exist, so that whenever a single network connection fails, it can still easily be routed around?


Almost always, yes, but the problem is that everyone has to start routing around the problem and it creates congestion. Those redundant pipes don't sit idle. They are sharing the traffic.

As mentioned in another thread, in this case, Google has rerouted google.com traffic out of the region to try to mitigate the congestion.


On a smaller scale, to link up a few datacenters that are a few miles apart? Sure. On a grand scale though, no. Nobody's running an extra undersea cable from Japan to Singapore so that they can have a ring topology. Or trenching a second PBps of cables across the Appalachian Mountains. When something like that gets busted you go and reroute your least important traffic and send out the repair crew.


Cool man let me know how I can run my Calendar in multiple regions.


Thanks for the reply Terrance. But isn't it more expensive to run in more than one region?


Absolutely.

For some customer it is the right thing for other customers it may not be the right thing.

Every provider will have failures. So the question mostly boils down to does paying for more then 1 region cost more or less then paying for the the lost productivity or revenue of an outage like this.

From some places the most costly things they spend money on is employees. If your whole company comes to a stop for even 1 hour. It may cost more then the engineering effort for multi zone, multi region or multi cloud for your critical environments.


how do you use multiple regions when Google only supports certain things in limited regions like Dataflow Shuffle only being available in a single region in north america https://cloud.google.com/dataflow/docs/guides/deploying-a-pi...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: