Migrating Millions of Concurrent WebSockets to Envoy

ex3ndr · on March 17, 2021

I am curious about backend part - do ws is still ws on services? Why? For example, why to have thousands of connections instead of a single one (or a bunch) that simply forwards websocket packets with some "connection id" with them.

This way you could restart service without killing ws connection, move all overhead of handling millions of connections to the lb.

jkarneges · on March 18, 2021

> why to have thousands of connections instead of a single one (or a bunch) that simply forwards websocket packets with some "connection id"

For anyone wanting to do this sort of thing, check out Pushpin [1]. It's a proxy server that can manage raw WebSocket connections on behalf of any HTTP backend. There are middleware libraries (such as [2]) to make it easy to write handler code.

Disclosure: lead dev. Came up with the idea after developing a custom connection manager thing at a previous company.

[1] https://pushpin.org [2] https://github.com/fanout/js-serve-grip/blob/master/examples...

bpicolo · on March 18, 2021

If I remember right, Slack's main application was PHP, now Hack, based, so I wouldn't be surprised to hear that app doesn't hold onto websocket connections itself - long-lived connections aren't historically PHP's wheelhouse

1f60c · on March 18, 2021

I didn't know anyone outside of Facebook used Hack! TIL.

SEJeff · on March 18, 2021

https://www.youtube.com/watch?v=nPMI1uSrS8g here is a video of one of Slack's lead engineer discussing their migration from php --> hack

ec109685 · on March 18, 2021

Slack has a Java server managing real-time messaging: https://www.quora.com/What-is-the-tech-stack-behind-Slack

“This part of our back-end, responsible for most of the real-time interactions happening in various clients, is implemented in Java, and accessed via a WebSocket API described here: Real Time Messaging API. A great deal of the user experience that makes Slack feel like Slack is the result of careful work on the message server. Many of the distributed systems challenges we face on the back-end are in coordinating this service with the rest of the (LAMP-ier) back-end.”

Matthias247 · on March 17, 2021

You could do that. But it would introduce a lot of head of line blocking on that connection. In return you Save a TLS handshake which might not be too important for long lived connections.

Plus it requires the backend to support that custom multiplexing layer.

muxator · on March 18, 2021

How so? Why there should be a head of line blocking? This would be a simple multiplexing of the payloads.

Matthias247 · on March 18, 2021

If there on the other connection a lot of messages are exchanged (and a websocket frame could be gigabytes long according to the spec) the other stream would be blocked behind that.Of course that can be mitigated by using a non websocket framing, but that in the end reinvents yet another transport layer.

winrid · on March 18, 2021

If you're looking to do this, I recommend the Nchan module for Nginx.

AlchemistCamp · on March 18, 2021

I remember reading about Discord hitting 5 million concurrent users back in 2017.

It would be really interesting to compare the current scale of the two offerings, in terms of peak concurrent users, hardware needed and engineering team size.

The user base is very different (business for Slack and gaming for Discord), but has started overlapping more and more in recent years.

yawaramin · on March 18, 2021

They handled the complex config in an interesting way. I'm taking a look at Dhall for a real use case at work recently, and thinking it might be able to handle a lot of complex config scenarios. I guess that remains to be proven out.

dilyevsky · on March 18, 2021

If you’re looking at other solutions check out https://github.com/stripe/skycfg It works with Envoy and lots of other things that support protobuf configs

yawaramin · on March 18, 2021

For the time being I'm targeting YAML and JSON, so satisfied with the feature set that Dhall is offering.

dilyevsky · on March 19, 2021

Easy to render output as either and bonus is if you know Python, Starlark is very similar

yawaramin · on March 22, 2021

I checked it out, but it's not obvious to me how to target YAML or JSON using it. In contrast, the first use case the Dhall homepage is how to convert to YAML. There are other things Dhall makes obvious, like how to create a record type to target specific YAML/JSON constructs, with type-checking.

Also I prefer Dhall's evaluation model (all values are at collapsed at compile time into a single output value with no side effects) to Starlark's full-fledged 'runnable' model, e.g. with the 'print' function I can print out anything I like, including possibly secrets and passwords. This is impossible with Dhall.

That said, I do wish that Dhall had something closer to Starlark's syntax and better type inference :-)

Aeolun · on March 18, 2021

I’m really hppy to see that there’s a whole new generation of services being released that finally feels like they cover all the modern use cases.

I had that feeling with nginx vs Apache too of course, but this is another level of that same thing.

ofrzeta · on March 18, 2021

I share that sentiment but also want to point out that HAProxy has gained a lot of support for "modern use cases" over the years, for instance the "Data Plane API" that lets you configure HAProxy dynamically.

endisneigh · on March 17, 2021

I wonder if Slack has considered using webrtc to do peer to peer chats on the client side and then gathering up the chat metadata and having each client periodically send their version of the history and reconciling it server side.

This would also have the effect of allowing slack to peer more or less normally even if Slack was down (of course bots, search, etc wouldn’t work).

lovedswain · on March 17, 2021

It's possible to implement all of this without inheriting the additional infrastructure and networking complexity WebRTC brings along with it, not forgetting WebRTC still relies on centralized components to coordinate. Don't use WebRTC unless you really need the features it offers, routers in many scenarios hate it and even where they allow it, the combinatorial explosion in possible configurations to support and diagnose between peers is a problem nobody should willingly invite unless they can't achieve a solution any other way

With WebRTC you give up the nice ultra-low-common-denominator "outbound port 443/TCP needs to work" requirement and replace it with "UDP networking generally healthy, possible to establish port mappings, possible to maintain stable port mappings over time, possible to not have mappings go away due to lack of traffic" etc etc

meheleventyone · on March 17, 2021

Hah, this is so true. Am building a little hobby project to try out WebRTC for game development. On my ISP provided router a Mac and Windows computer can’t see each other over WiFi due to some mDNS issue likely the router support for multicast. Using Chrome flags to turn off mDNS and they can connect fine but obviously expose internal IPs. Wire one of the machines and mDNS works. TURN is essentially a necessity but then why not use a server (particularly for a chat app).

mypalmike · on March 18, 2021

If your topology is essentially client server, you don't need a TURN or even a STUN server for webrtc connectivity. The client can send dummy candidate IPs in the offer, but receives valid server IPs in the answer. The client starts initiating transport to the server and reaches it. Peer reflexive candidate processing by the server allows it to communicate back to the client even if transport is UDP.

It's definitely more complex than websockets, but removing STUN servers from the equation does simplify connection initiation a bit.

meheleventyone · on March 18, 2021

I'm aiming for peer to peer. :)

mypalmike · on March 18, 2021

Interesting! There aren't a lot of peer to peer multiplayer game architectures nowadays. Good luck!

meheleventyone · on March 18, 2021

Thanks, it's actually more common than you'd think particularly in certain genres. For big examples with a large player bases you have GTA Online and the Destiny games. Fighting games also tend to run peer to peer both because there is (usually) only two people in a fight and because they are quite latency sensitive.

My own interest comes from many years of building multiplayer games professionally and wanting to make smaller scale multiplayer games for fun without needing to invest too much in infrastructure.

littlestymaar · on March 18, 2021

If you're using the DataChannel only (which is common when using WebRTC for games) you don't need a TURN (that adds complexity and you don't need it useless you're dealing with media streams) instead all you need is to re-use your signaling server as a fallback to get your messages from two peer who cannot establish a connection to one another.

meheleventyone · on March 18, 2021

That's merely equivalent though. Ultimately TURN is just a common specification for a relay. You also run head first into latency issues where you can run one signaling server globally but need regional relays to keep latency down.

Could you expand a bit on why MediaChannels need TURN specifically? Is there some complication that prevents just dumb relaying of the packets ala what you're suggesting?

SahAssar · on March 17, 2021

Sounds like you mean STUN, not TURN.

meheleventyone · on March 17, 2021

No, I’m using a STUN server. This issue is unrelated and due to the local IPs being masked by mDNS addresses so that local network topology isn’t leaked to the world at large and my routers handling of mDNS. Which is why everything works over the local network if I disable mDNS use in Chrome. TURN is the ultimate fallback to being unable to NAT punch.

Ironically getting machines connected across the internet with WebRTC has so far been relatively smooth sailing.

mypalmike · on March 18, 2021

TURN should never be necessary if one of your endpoints has a public IP address.

littlestymaar · on March 17, 2021

There's some truth in what you said, but also a few exaggerations.

First of all, while WebRTC has its share of complexity when using it for videoconferencing, here we are talking about using the DataChannel, which is really straightforward to use and doesn't need additional infrastructure.

> not forgetting WebRTC still relies on centralized components to coordinate

It needs a centralized component to setup the connection (signaling), if it fails later, your communication channel is still up. And the good thing if you have a websocket-based chat service, is that you can directly use it for the signaling purpose with zero modifications on the back-end side.

> routers in many scenarios hate it and even where they allow it, the combinatorial explosion in possible configurations to support and diagnose between peers is a problem nobody should willingly invite unless they can't achieve a solution any other way

When using the Datachannel, your failure mode is can't establish a connection, not some hard to understand Heisenbug. All you need is to provide a centralized fallback for clients who cannot establish a connection. This fallback will depend on the centralized service being up, but in case of failure you'll keep most of your users without disturbance (at least in the first world, the network is not as WebRTC friendly in other places of the world). And because the DataChannel's API is close to the WebSocket's one, implementing the fallback is straightforward.

Though, in Slack's situation there is a good reason not to use WebRTC: they can have several thousands of people in the same channel (IIRC IBM uses Slack and have most of their employees in a shared channel for official announcements). You won't be able to do that with WebRTC[1] if a user needs to establish a connection with every other users in the channel (there's just not enough ports available). And even worse, back in 2016, Chrome's implementation of the DataChannel was so poor, you could not establish more than a handful of PeerConnection before feeling the browser's becoming sluggish (this wasn't the case in Firefox so maybe Google fixed that since then).

Also, Slack's users are likely to be in some enterprise network, which makes WebRTC more likely to fail than when you customers are home, which reduces the opportunity.

Main takeaway: WebRTC-based chat is probably not a great fit for Slack, but don't be afraid of using it: it's not hard, it combines well with your already existing centralized infrastructure, and can massively reduces the load on it.

[1] unless you want to build some fancy sparse mesh network, but this is likely overengineering.

SahAssar · on March 17, 2021

It sounds like the only thing you did was signaling, not STUN and TURN.

If you do both STUN and TURN it works on most networks. I've worked at really restricted work sites, and while STUN fails at those if you have a TURN server then it almost always works.

These sort of comments are why people think webRTC is unstable while the same people use slack calls which literally use webRTC.

I might be wrong, but please don't talk about network reliability in webRTC without specifying if you have a working STUN and/or TURN setup.

lovedswain · on March 17, 2021

What has WebRTC got to do with me personally? Please try to speak to the topic, it makes for better reading, and threads much less likely to lead in the wrong direction.

> If you do both STUN and TURN it works on most networks

TURN requires clean outbound UDP/TCP connectivity which is far from ubiquitous, there are numerous corporate firewalls where "CONNECT <x>:443" is the best that could be hoped for, and even in some of those, where if the resulting connection did not include an obvious SSL handshake the connection would be immediately reset.

ryanianian · on March 17, 2021

I suspect there may be regulatory restrictions about allowing text-based communications that aren't available during an audit.

ssss11 · on March 17, 2021

What regulation do you think would apply? And how/why would this regulation differ for e2e encrypted chat products like Signal, Telegram, WhatsApp etc that can’t access text based chat messages?

zonotope · on March 17, 2021

IANAL, but the enterprise companies that make up Slack's customer base are often under regulations to preserve their employees' official communications in case they are needed for future investigations. Those same regulations prevent them from using the products you listed as official communication channels.

toomuchtodo · on March 17, 2021

FINRA recordkeeping and retention requirements, as well as SEC statute around records and reporting requirements (finance industry specific).

detaro · on March 17, 2021

Companies that have such audit requirements do not use Signal et al either for internal comms.

jeffbee · on March 17, 2021

When you control both client and server it seems like hot restart is just a complicated stunt you don’t need. Isn’t it fine to just stop accepting connections, tell all your clients to reconnect, and do a normal restart? The frontend load balancer that stands between you and Gmail doesn’t know how to restart hot but you probably never noticed.

mbyio · on March 17, 2021

I think Slack is different than Gmail because people are actively having conversations, so if you disconnect, it is much more likely to be noticeable and annoying.

Reading between the lines, I think what they would need is a way to tell clients to move to a new websocket connection at the proxy layer. I don't think there is an easy built-in way to do this in the websocket protocol, so they would have to implement something custom in their application layer. This would also require triggering custom code in the client to make a new websocket connection, start using it, and then close the old connection.

I feel like it would have been simpler to just have the client do a graceful reconnect every 5 minutes. But they probably decided to use envoy so they could have the other advantages too.

zemo · on March 17, 2021

maybe it could work, but in practice it's often not as easy as you'd like it to be. disconnecting everyone at or around the same time simultaneously can easily create a thundering herd or a TCP global synchronization problem, so "just ask everyone to reconnect" has its own set of complications.

toast0 · on March 17, 2021

Depending on where the load balancer is in the protocol stack, it might be hard to signal the client to please reconnect. I've tended to run/use load balancers in tcp mode or direct server return mode; in both cases, the load balancer can't signal the client or server (except by killing connections, which isn't ideal).

If you do restart with reconnect, many clients will need to reconnect multiple times, as it they get kicked off early, they'll tend to connect to another old server, and have to get kicked off again later. This isn't great because connection startup is usually expensive; also a (hopefully) small minority of clients may have difficculty connecting because of network issues and it may take them a substantial time to get reconnected.

If you have a working hotload path, you get clients to new server server code faster using less resources.

Load balancing software is also often behind some other layer of load balancing/availability (DNS or ECMP or BGP or CARP), and a hot load means you don't need to churn that layer, which can avoid any issues with changes in that layer.

jsiepkes · on March 17, 2021

> Isn’t it fine to just stop accepting connections, tell all your clients to reconnect, and do a normal restart?

Dependents on how many config changes you need (per day).

Besides Envoy supports it and I would call it a bonus if you can reload your configuration without client interruption. As for complicating things the implementation for hot reload isn't terribly complicated in Envoy.

jeffbee · on March 17, 2021

I'm mentally separating the hot restart part from the reloadable configs part, even though they are together in the article. To me, not having reloadable configs is too crazy to even imagine.

hermanradtke · on March 17, 2021

> stop accepting connections, tell all your clients to reconnect

This "drain" pattern is great for maintenance, upgrades, etc too.

The only caveat is that the clients need to be given time to migrate. How long that is depends on how well the clients behave. A hot restart may be much faster.

paxys · on March 18, 2021

It is normal to have production deploys continuously throughout the day. If your client disconnected and reconnected every time a Slack engineer decided to push some code, it would get annoying pretty quick.

jeffbee · on March 18, 2021

I don't see why even an endless series of connection interruptions would ever need to come to my attention.

jayd16 · on March 17, 2021

The Gmail load balancer has to do a cold restart to add or remove an instance? That's the requirement they placed on themselves because they do not trust the runtime HAProxy api.

jfrunyon · on March 18, 2021

Are you saying that Slack don't trust it? Or Google? If the former, it's actually because they don't trust their code that used the HAProxy API. https://slack.engineering/a-terrible-horrible-no-good-very-b...

> Consul-template renders a new version of the host list, and a separate program developed at Slack, haproxy-server-state-management, reads that host list and uses the HAProxy Runtime API to update the HAProxy state. ... However, over the course of the day, a problem developed. The program which synced the host list generated by consul template with the HAProxy server state had a bug.

theflyinghorse · on March 17, 2021

If they do not trust HAProxy runtime API then why are they using HAProxy at all?

vad_ · on March 17, 2021

Haproxy itself is a solid piece of software. The runtime API is something they added on top of it because of competitors (envoy).

forgotmypw17 · on March 17, 2021

Is there a plan to migrate to an open protocol to allow accessible clients?

kureikain · on March 18, 2021

This is great but why they have to do that? Slack uptime isnt great. They might be up globally but they definerely degrade performance all the time. I got issue like fail to send message very often.

Plus, it is just chat. We re really ok with it being down

Aeolun · on March 18, 2021

You may be ok with your chat being broken at random times, but I think it’s fairly annoying.

ColeByrne · on March 18, 2021

This is big news. Simultaneous migration of web sockets will require a lot of power and strong control over the process. I see that you did it, how did you do it? Share your experience, thank you in advance. Specialists are always appreciated (https://www.ramotion.com/)