However many split brains and grey hairs later I decided RabbitMQ was almost never worth it regardless of how many of AMQPs advanced features you could make use of.
For the longest time I just made do with Kafka but this had serious deficiencies when implementing queues because of the cumulative ack only nature of Kafka.
Recently I have started using Pulsar which provides selective ack and all the best parts of AMQP without the complexity and unneeded parts. i.e it has things like scheduled delivery and TTLs in addition to the all important shared subscription which makes queues "just work" on top of streams.
If you want something like RabbitMQ but with a simpler API and are comfortable with JVM services give Pulsar a go. It's not for everyone but if you are already using a lot of the big data stack it's probably a good fit.
Which of course leads me to believe the problem isn't with the people but with the ridiculously high threshold of knowledge, experience and app developer self-control needed to run RMQ successfully.
As parent said, many meltdowns later, I'm now firmly in the "No Rabbit!" camp. Redis pubsub/queues for immediate lossy delivery, kafka / gcp pubsub / aws sqs for less latency sensitive flows that require more consistency guarantees.
RabbitMQ is one of those things that I've always found better to let the experts run (managed SaaS), unless your team is really wanting to take on the burden of becoming an Erlang distributed system debugger :)
Pulsar seems really interesting... There are now more managed Pulsar offerings coming online (StreamNative, DataStax who bought Kesque, Pandio, etc)
You’d use Kafka more as an unbounded buffer and build different paradigms on top of it. It not unusual to ingest 100s if mbits of data into kafka, potentially saturating the network while also reading that much out. Amqp is better for large number of queues where each queue has less messages in. Think mqtt, websockets - many, many consumers.
It would be reasonable to use both next to each other.
But I’d never go for rabbitmq. I’d go for azure servicebus or artemis with qpid.
I generally think of messaging systems falling into 4 distinct categores: PubSub, Streaming, Queues and Enterprise Messaging Systems.
PubSub sytems are focused on non-durable (usually), low latency messaging generally without acknowledgements and generally at-most-once. i.e things like Redis PUBSUB, NATS, etc
Queues are generally focused on fanout to multiple consumers with at-least-once processing of durable messages with acknowledgements. i.e Celery/Sidekiq, Que, AWS SQS.
Streaming systems are designed for throughput and usually are based on some form of a distributed log concept. Generally offload offset management to consumers. i.e Kafka, Kinesis
Enterpise Messaging Systems favor flexibility above all else and usually have some mechanism of encoding the flow of data separately from the applications themselves. i.e exchange routing topologies in AMQP as an example. They can generally implement pubsub, queues and direct messaging paradigms. Tradeoffs being poorer availability, complexity and poor performance vs specialised systems. i.e RabbitMQ, HornetMQ, etc.
So you end up using Kafka when it's limitations aren't a problem and you need the throughput. It works best when each and every message in a stream is homogenous as such failure to process a message is unlikely to be independent of failure to process a following message. This alleviates the main drawback of streaming systems which is head of line blocking.
Some cases where it works very well is event streams, data replication/CDC, etc.
If order of message processing matters, then Kafka is better suited then AMQP. For example, In a distributed application for money transfers, if AMQP used, message order will be lost and some problems will occur in the following scenario:
User A with an accound of $1000 makes order for two transfers T1 ($600) and T2 ($500)
- Rabbit delivers T1 to server1, before processing message, server1 enters a full GC.
- Rabbit delivers T2 to server2 and server2 processes message immediately, now User A's account have $500
- Server1 resumes its life after the end of GC, but fails to process T1 since account's balance is less than required amount.
However, it is T2 that should have failed because User A ordered T1 first and T2 after.In Kafka, when user account identifier is used for partitioning key, all User A's messages will be processed by same consumer (i.e server1), so even if server1 enters a full GC, that is OK, since T2 will be processed after T1.
It makes the brokers as dumb as possible to optimize for performance and the logic sits in the client. You can ask to read back from the log at any point or at the current tail. Acknowledging messages is just writing a bookmark to another topic saying where you last read up to, or you can keep track of it yourself somewhere else.
You can always build more complex logic on top which Confluent has done with things like ksqldb.
I'm curious what makes it go on your "never again" list? We've definitely had our fair share of issues with it, namely -
- Really easy to misconfigure queues/exchanges, especially trying to do something like have a retry + DLQ setup.
- If you have a queue build up to a large number of messages (100 million+) for whatever reason, purging it will probably bring down the cluster.
Overall, our experience has been mostly positive. It isn't on my "never again" list, but I'm definitely wary of some parts of it and it is on my list of one of the more difficult pieces of our infrastructure to scale.
We built a Kafka consumer that's effectively capable of selective acks by producing bad messages to separate topics. It's a little silly but it works.
It has been great overall.
So good I have recently decided to slow down client work and build a managed SaaS offering for Pulsar: https://turtlequeue.com It is a work in progress, however it is a bit different from the nascent Pulsar offerings out there.
The main goal are ease of use and being cheap. How do I go about it?
1. Behind the scenes there is only one pulsar cluster. This lowers the costs of hosting dramatically. Even the smallest production pulsar cluster requires:
- ZooKeeper node(s)
- Bookies nodes
- Brokers node(s)
- (optional) Function workers node(s)
- (optional) Proxies node(s)
- Pulsar Manager
- Prometheus
- Grafana
.. typically this runs on top of kubernetes these days, so throw in volume storage and a LoadBalancer. Hosting small setups is costly.
By having a shared cluster I can lower the costs enough to provide a free “try me” service at little to no cost to me. And nobody will suffer from the “noisy neighbour” as Pulsar is designed to be multi-tenant and can enforce limits per tenant.2. Tq (turtlequeue) users do not have to care about how the cluster operates (typical SaaS). It is also dramatically easier for me to monitor and operate only one cluster.
3. How do I expose this safely and make it easy for users to use Pulsar then? Experienced Pulsar users will notice that this is not easy to do at the moment with pulsar. I am developing a custom proxy! This in turns allows me to collect metrics/enforce finer permissions, present a nicer dashboard.
Where am I now? The custom proxy works, the website/docs/login/dashboard/metrics/pricing need a lot of TLC. So “soon”. I will be looking for beta testers, if you are interested please email turtle@turtlequeue.com Feel free to email me too if you just want to be kept in the loop :)
RabbitMQ is an excellent messaging middleware. But simply remember that it is not designed for optimal performance when holding on to data. "It is a river, not a lake". Performance is very sensitive to the amount of data that is in flight between send and receive end points.
Kafka is a "lake", but will not give you the rich routing and diverse semantics of Rabbit. If you are building 'event sourced' systems, Kafka and similar systems are a better choice.
Pulsar has a highly articulated architecture. It is built on Bookkeeper and decouples the persistent store servers from the client servicing servers. If you want to avoid the rebalancing pain of Kafka, Pulsar is the solution. However, Puslar has many more moving pieces.
Both RabbitMQ and Pulsar are authentic 'middle ware', and extensible. Kafka is, true to its genesis, a highly performant distributed log.
Durable, recoverable, and performant distributed messaging/journaling is inherently complex. Make sure you know what is it that you precisely require, and one of the above solutions will likely serve you well.
My observations so far:
* Running a single RabbitMQ is pretty boring in the good sense.
* We haven't managed to switch to a cluster for HA yet; it seems that software that deals with RabbitMQ clusters must be cluster aware (consume from queues on all instances and the likes), and it wasn't worth our effort to fix all the applications.
* In the long run, the lack of tooling is hurting us. Want to do green-blue deployments? Canary deployments? When your services run on HTTP(S), there are simply tools for that. When your services consume from AMQP queues? You have to go searching for solutions, possibly build your own plumbing.
In the end, it turns out that we use publish/subscribe far less often that direct request/response patterns, so for new stuff I'd likely go with HTTPS instead of AMQP today.
Seems your conclusion “DON’T” implies the former, but this seems unnecessarily extreme.
What didn't help was the pika python client had so many issues. One of my github problem reports had a 20 line demo showing how it broke with anything but a trivial load. I gave up soon after as I had further problems in other languages. Over a year later someone looked at it and said 'yep fails under load' and fixed it.
One of those projects that showed so much promise only to make me sad.
[1] https://docs.aws.amazon.com/amazon-mq/latest/developer-guide...
I've found RabbitMQ difficult operationally, and full of footguns (that can make you lose data), so I'm not sure why would you want to use it if you don't already have to.
While my past experiences with RabbitMQ in production have been stellar, I can see why a team would be hesitant to add this complexity to their infrastructure.