Ask HN: How do you scale WebSocket?

13 pointsravirajx73y ago16 comments

Hi All,

I have a screen which displays QR code on browser screen & at the same time it opens a Websocket connection with my backend(Spring). Once the payment is done a Webhook response comes to one of my backend endpoint with the payment state(success/failure) & other order details.

The application is working fine if it is running locally or with only single instance. But since we have our own Auto scaling group and load balancer configured over DNS, the connection is not getting established all the time.

So, how exactly shall this be architectured so as to scale the same horizontally? I don't have any DB configured as of now. I have thought of using SNS with SQS but it seems they are way too many overheads. How do big companies like WhatsApp scale?

16 comments

toast03y ago

From what you've commented elsewhere, it sounds like the immediate question is how do you route an external event (webhook) to the right websocket.

You need to include information that comes back in the event that identifies the websocket, either directly (server id + an id the server would understand) or indirectly (something that you can look up in a table/hash to get direct information).

More likely than not, you'll actually want to do the indirect option with something that identifies the browser (session id?) in case the browser reconnects to the websocket before the event comes in. Connectivity is fragile, and clients roam across wifi access points or between wifi and lte, or sometimes between lte boundaries that necessitate different IPs, or sometimes their modem is reset while they wait and they get a new IP, etc. Or something in their network path hates the world and closes idle connections on a very aggressive timeout, or closes connections after a short timeout regardless of idle. Lots of scenarios where a reconnection is likely.

Finally, since you asked about horizontal scaling, my best advice is to scale vertically first. It's usually simpler to manage one server than can do 1M connections than 1k servers that can each do 1k connections. Depending on details, less servers, but larger can be less expensive than many smaller servers; although that changes when your large servers start getting exotic, more than two cpu sockets is a big inflection point, you most likely want to scale horizontally rather than get a quad socket monster (but they exist, and so do eight sockets)

unraveller3y ago

https://medium.com/@14domino/using-nats-to-build-a-very-func...

no database, just have the user's websocket reach a simple websocket server which always sends on requests to a fuller API server which can speak back to the NATS server who triggers a push to the user since the websocket server is coupled to NATS. This gives horizontal scale to API servers (if they ignore users/work not for them) and websocket servers.

monroewalker3y ago

I could be wrong, but I wouldn't think the autoscaling or load balancing would affect the websocket connection. There may be another aspect of the infrastructure that's preventing the connection though. Can you share more about the setup? Strange that the connection would succeed sometimes but not others.. This could be related to the configuration of some intermediate network layer. Eg. if Nginx is used, you may have to look into the settings that are needed to ensure websockets work well. Take a look at these pages:

https://stackoverflow.com/questions/12102110/nginx-to-revers...

https://stackoverflow.com/questions/10550558/nginx-tcp-webso...

ravirajx7OP3y ago

Hi, We're not using Nginx. We are using Docker image & everything is configured using cloudflare (whose details are something which I am not aware of properly).

Though, here the problem with websocket is that they are stateful and whenever a connection is established it is directly getting established with one of the instance from the list of several instances due to loadbalancer. Now, whenever a new Webhook response comes as it's a normal post request and it doesn't have information regarding which instance was used earlier for making the websocket connection, it may send request to one of the instance where the connection was not established and thus our backend is not able to process this request from this particular instance.

monroewalker3y ago

One thing to consider would be Kafka. When a webhook comes in, you publish a message to your topic that includes some reference to the user such as the payment id or user id. Each server consumes from the topic and if it finds a message for which it has an active websocket (or long polling) connection, then it pushes that message back to the user. This page shows how to have all consumers consume all messages: https://stackoverflow.com/questions/23136500/how-kafka-broad.... Spring Boot has really convenient integration with Kafka which makes setup pretty straightforward. The AWS Kafka service is also quite easy to setup. Having all servers consume doesn't scale ideally since more payments and webooks will result in more kafka messages sent to each server, but partitioning would probably be too tricky with a variable number of servers due to autoscaling.

Another approach could be to save the association between the server and the session in a database. When a webhook comes in, if the current server doesn't have the target session, lookup which server does and make a request to an internal endpoint on that server to send the message over.

You could also look into Redis for this. Have the server which is handling the websocket subscribe to key changes for a key associated with the user's payment. When a webhook is received, just update that key in Redis

timebomb03y ago

Socket.io has a great article on how you can setup your architecture to scale Web Socket servers: https://socket.io/docs/v4/using-multiple-nodes/

monroewalker3y ago

The whole article seems to be about sticky sessions which are needed for the long polling fallback transport option Socket.io uses when websockets can't be used.

Eg. from the article: " the WebSocket transport does not have this limitation, since it relies on a single TCP connection for the whole session. Which means that if you disable the HTTP long-polling transport (which is a perfectly valid choice in 2021), you won't need sticky sessions "

bobkazamakis3y ago

>since it relies on a single TCP connection for the whole session.

still has to exist and stay active. (ie, interact with the correct node)

blablablub3y ago

" the connection is not getting established all the time" Which connection does not get established? The webhook connection to your backend or the websocket connection to your backend? Or do you get a webhook response, but failing to send a response via websocket?

ravirajx7OP3y ago

It's the websocket connection which is not getting established. Webhook response is a simple Post request and it's coming fine. The problem here is sometimes this webhook response comes to that particular instance where the websocket connection was not established with the client.Ideally I would be happy if this response comes to both of the instance through some configuration on loadbalancer.

blablablub3y ago

As per your other answer, websocket connections are established just fine

"Now, whenever a new Webhook response comes ... it doesn't have information regarding which instance was used earlier for making the websocket connection, it may send request to one of the instance where the connection was not established and thus our backend is not able to process this request from this particular instance."

just include the internal ip of the websocket connection to the data you send to your billing operator and then forward the post request appropiately.

blablablub3y ago

Easiest way is to check on the client side if the websocket connection is established, if not, just try to reconnect after a couple of seconds. Maybe there is a problem on the backend that dns is faster than establishing a new instance, so retrying is your best option.

matt3213y ago

Id use a database and have the websocket just periodically check the database for payment confirmation. This way when once instance get the webhook activity, it puts it in the database and then.... ??? .... profit

ravirajx7OP3y ago

Hi, Thank you for responding. Actually we have one external API which provides this detail as soon as the payment is done. But we wanted something wherein once the payment is done we can directly send back the details directly to the connected client using that incoming webhook response instead of making the client(browser) to call one another API/fetch state present in DB periodically.

banashark3y ago

Without knowing more than the details you've just provided, I would suggest giving the various options a second look and weighing the pros/cons.

Websockets to me don't seem like the ideal approach here, since the communication is just from the server to the client (an update of a data payload once an event occurs in the backend system).

Websockets have quite a bit of technical complexity requiring significant architectural effort to ensure reliability of a service. Ably is a company that offers websockets as a service and has some good blog articles to start you out if you're sure about this path.

What I would recommend with the details provided so far is to either use SSE or long-polling. The "downsides" are often over-exaggerated, and there are lots of businesses that one would assume use websockets that really are just using SSEs because operationally and architecturally it is vastly simpler to reason about.

I can almost guarantee that the complexity of adding another API endpoint will be drastically less than standing up a reliable websocket infrastructure.

ravirajx7OP3y ago

Hi, I agree with each of your points you mentioned. Still it's kind of curiosity within me which is making me think to achieve this goal as we have ample time and this indeed is a business need.

It would be very nice of you if you can share few other solution other than using Ably's API :(

j / k navigate · click thread line to collapse

16 comments

toast03y ago

From what you've commented elsewhere, it sounds like the immediate question is how do you route an external event (webhook) to the right websocket.

unraveller3y ago

https://medium.com/@14domino/using-nats-to-build-a-very-func...

monroewalker3y ago

https://stackoverflow.com/questions/12102110/nginx-to-revers...

https://stackoverflow.com/questions/10550558/nginx-tcp-webso...

ravirajx7OP3y ago

Hi, We're not using Nginx. We are using Docker image & everything is configured using cloudflare (whose details are something which I am not aware of properly).

monroewalker3y ago

timebomb03y ago

Socket.io has a great article on how you can setup your architecture to scale Web Socket servers: https://socket.io/docs/v4/using-multiple-nodes/

monroewalker3y ago

The whole article seems to be about sticky sessions which are needed for the long polling fallback transport option Socket.io uses when websockets can't be used.

bobkazamakis3y ago

>since it relies on a single TCP connection for the whole session.

still has to exist and stay active. (ie, interact with the correct node)

blablablub3y ago

ravirajx7OP3y ago

blablablub3y ago

As per your other answer, websocket connections are established just fine

just include the internal ip of the websocket connection to the data you send to your billing operator and then forward the post request appropiately.

blablablub3y ago

matt3213y ago

ravirajx7OP3y ago

banashark3y ago

Without knowing more than the details you've just provided, I would suggest giving the various options a second look and weighing the pros/cons.

Websockets to me don't seem like the ideal approach here, since the communication is just from the server to the client (an update of a data payload once an event occurs in the backend system).

I can almost guarantee that the complexity of adding another API endpoint will be drastically less than standing up a reliable websocket infrastructure.

ravirajx7OP3y ago

Hi, I agree with each of your points you mentioned. Still it's kind of curiosity within me which is making me think to achieve this goal as we have ample time and this indeed is a business need.

It would be very nice of you if you can share few other solution other than using Ably's API :(

j / k navigate · click thread line to collapse