Let me share my anecdote. In my last work place I got onboarded on rabbitmq and it was such a painful software to work with and almost impossible to set up locally that I silently sneaked in simple redis list as queue alternative for my dev environment. The whole rabbitmq and it's pika library was replaced by 3 lines of python and redis server.
One day rabbitmq died and it tooks sys admins few weeks to get it back running. In that time I deployed my simple redis list and never looked back. To this day redis solution works without any friction whatsoever with fraction of resources.
The rabbits AMQP exchange model is severely flawed and convoluted. It's the worst example of corporate software where everything works and doesn't work at the same time.
I wouldn't recommend rabbitmq to my worst enemy yet there's still something attractive about it. Maybe there's a sane alternative? Maybe zeromq?
That doesn't mean redis is a drop in replacement for any of the valid uses for rabbitmq though...
It feels common that people think they their every problem requires a planetary scale solution or one that handles every conceivable case that could occur before the heat death of the universe. Be that because your a startup and think you are going to need to support a hundred million concurrent users on launch day, or because your an enterprise and think because your are oh so important you need to use the same tools other important companies use.
My own anecdote: When we refactored our early stage app we had a huge mess with the Redis based queue system. It was one of the biggest sources of errors and a massive pain to troubleshoot or even to monitor what is going on. So we investigated in a bunch of different solutions including all the usual contenders, and we ended up with: Let's just ditch the messages / queues altogether for now and just do boring old cron like jobs invoking internal api endpoints in regular intervals. This made 9 out of 10 cases a lot easier to maintain, and for the remainder, while temporary more difficult, we introduced a queue again at a much later stage.
I'm not saying you shouldn't use and you can't benefit of RabbitMQ at small scales or any technology for that matter. But I think too often in tech decision making ones own or the companies perceived importance, what is cool or what would look good on a CV takes precedence over what really fits the problem in context.
I agree rabbitmq is often adopted in places it shouldn't, or ill-configured.. But to completely get rid of it because you didn't take the time to RTFM when setting it up seems a little extreme.
And redis/rabbitmq have completely different use-cases 80% of the time. Sounds like you were trying to get drunk on kombucha.
Went back again to tried and tested rabbitMQ and it works so well. Also adding new nodes and removing nodes so easy, just used ansible to setup Erlang cookie and connect the node (thanks to OTP and BEAM). The best part is for important task queue for which we cannot accept failure we built a mechanism for fault tolerance. When you work with high availability and fault tolerant queue rabbitMQ is so good, it can recover from hardware or even VM failures, can’t say the same for redis which was a nightmare even with redis cluster.
> ZeroMQ (also known as ØMQ, 0MQ, or zmq) looks like an embeddable networking library but acts like a concurrency framework.
And the old meme was "ZeroMQ is a replacement for Berkeley sockets".
It's a pretty cool networking library. But it makes no sense to think of it in the same slot as RabbitMQ, or even Redis.
That the GP mentioned it does make me wonder if they don't really understand what they're doing.
EDIT I love that the official guide actually has this diagram in! Pieter Hintjens's death was a sad loss. http://zguide.zeromq.org/page:all#How-It-Began
Some criticize NATS for the absence of message durability in the core technology, but we figured out we can drop this requirement in 95% of cases. Your microservices should be highly available anyway, so there's always a live consumer - and it's better to handle the message to it directly rather than introducing the overhead of storing a message somewhere and dealing with more-than-once delivery.
There's a bit more to that like you should let your microservices exit gracefully and finish processing consumed messages. And of course in 5% of cases, where you can't allow to lose a single message, you have to use NATS Streaming or the likes, but so far we've been greatly impressed by NATS for high load.
$ size /usr/lib/i386-linux-gnu/libzmq.so
text data bss dec hex filename
662134 12952 24 675110 a4d26 /usr/lib/i386-linux-gnu/libzmq.so
Yikes, is there an operating system kernel with virtual memory, USB support, a few filesystems and a TCP/IP stack hiding in there?It's like 35% of glibc:
$ size /lib/i386-linux-gnu/libc.so.6
text data bss dec hex filename
1917549 11624 11112 1940285 1d9b3d /lib/i386-linux-gnu/libc.so.6
Which is itself heavily bloated, trying to provide as much POSIX as it can.I found it quite simple to install RabbitMQ server and its admin panel in my WSL local dev environment.
And the cloud/prod instance took a few clicks (just spun up a DO Marketplace server image) followed by < five minutes of RabbitMQ user and firewall configuration.
It was also dead simple to start using RabbitMQ within my application. I found a well maintained package, installed it, edited a couple lines of my application's config, and everything just worked.
I specifically avoided Redis based on my understanding that it can't guarantee message persistence, so if it crashes, your unprocessed messages are lost.
while True:
msg = r.lpop(key)
try:
result = do_something(msg)
except Exception as e:
log.error(f'failure for msg "{msg}" got {e} back to queue {key}')
r.rpush(key, msg)
continue
Pop a member from a list, if failed plop it back to the end. It's simple, explicit and just works™Never had any stability issues though. Once it was up and running, it was solid.
First, you'll learn about ack/noack and get the worker ack on success.
Then, you'll learn about dead letter queue ... etc for delayed retries.
Now, you'll have a topic exchange and a bit hairy routing in place using wildcards.
And you mistakenly set dead letter routing key so that expired messages end up in multiple queues (retry queues and actual worker queue ... ).
Then you rewrite your service in python and use Celery or something.
It's nearly impossible to get RabbitMQ working correctly within few months.
And I forgot about HA. Paying for hosted RabbitMQ might be better. But CloudAMQP in particular could be tricky as well. It can run out of AWS IOPS and your production gets hosed.
Also setting up monitoring on queue health, shoveling error queues ... etc take time to learn and apply. Be careful about routing keys when you shovel error queue to a topic.
Back to RabbitMQ though, we run a HA 2 node deployment (just one active writer) and have been for over 3 years, requiring minimal changes or any kind of maintenance whatsoever, has scaled to hundred plus queues, going from some with super high numbers of messages per second, some with only tens of messages per day. Some queues stay low and process fast, others are heavy jobs that get enqueued all at once and generate hundreds of thousands of jobs.
Sure, if you have a service that interacts with disks you should have automated a monitor that cover your IOPS consumption, but I don't see how that's specific to RabbitMQ, you should be doing this for all your instances.
All in all, these are two identic instances, one active, one failover, and in a world of Kafkas and Pulsars and understanding the ins and outs of SQS pricing and capacity allocation, RabbitMQ is a tool that I consider simple to administer and allows me to sleep at night.
Interesting how the same tool can evoke such different reactions, but whatever works - works.
There's ways to repair it (and it has happened to me one total time in 4 years), but it does happen. I personally try to make my message processing idempotent for the worker to help alleviate these situations.
Granted I don't know all the intricacies of RabbitMQ and this was just one step beyond os.popen, but it was painless, like half an hour painless to set up and it has worked really well.
*edit: reading some of the other posts now I'm waiting for the other shoe to drop. but so far it's worked wonderfully.
When I first started using RabbitMQ I experienced just about everything you described.
I felt incredibly stupid when a customer would have issues with a queue being stuck or messages that were being dropped, and having no clue on why this was happening.
> It's nearly impossible to get RabbitMQ working correctly within few months.
This is so true. You can get it running in 10 minutes, but it takes weeks of banging your head against the wall and angry customers before you have it running right.
https://docs.microsoft.com/en-us/azure/storage/queues/storag...
Thinking of doing something that works like an async generator so I can just use it like...
const work = queue.subscribe('somequeue');
for await (const {item, done} in work) {
// do something with JSON.parsed item from message
await done(); // wrapper for the delete/finish
}I wish a standard set of higher abstractions existed on it though. Celery, from what I hear fills that gap very well in the python world but nothing like this exists in Nodejs land which leaves the room open for a bunch of redis-backed solutions which are pretty fragile in comparison.
https://redislabs.com/ebook/part-2-core-concepts/chapter-6-a...
- ActiveMQ more featureful, robust default settings, better integrated with Java/JMS but slower
- RabbitMQ faster, simpler, more "just works"
The defaults of ActiveMQ lean more towards robustness (hence often naive benchmarks will tell you it's slow). However in practice it is pretty damn easy to run, you literally can just download the default cross-platform distribution and type `./bin/activemq` and it will start running.
We use ActiveMQ + Apache Camel which makes a pretty nice combo to achieve lots of generalised messaging and routing functionality.
Yeah we hear you regarding AWS IOPS: for some type of loads and smaller plans we need to offer an alarm + an easy way to scale IOPS. It is something we're working on.
This was back in 2015 - might be better now.
The only downside is once you get message-queue-pilled, you start seeing opportunities to refactor/redesign with message queues everywhere and it can be hard to resist the urge. It really is remarkable how, when used appropriately, message queues can dramatically simplify a system.
I think this is why email will never die. It's basically turned into a huge message queue. Even voice mails come into my inbox.
====== EDIT - I meant to say "huge universal message queue" and left out the word "universal" accidentally
There's a lot of work in mailer-daemons to ensure that email has as reliable as possible delivery in a store-and-forward system..
https://pypi.org/project/dirq/
Perl had the original implementation, and there are implementations in other languages.
Also, do you happen to know how well it works in a fault-tolerant way for communicating between services that are in different data centers?
My main use-case is to receive status/change notifications from a service running elsewhere from the API server servicing the UI, in order to avoid polling for new data.
I've also looked into other solutions (ActiveMQ, Google PubSub, ...) and RabbitMQ is by far the most straight-forward and quick to set up. There are some edge cases that it doesn't cover as well, for example automatic retries, but there are some "RabbitMQ patterns" to make it work. For a simple message broker/queue system, it's great and the docs are also great.
RabbitMQ gave us such a performance increase that we killed our database. We ended up having to rate limit RabbitMQ!
In general the defaults are pretty good I think. There is a one page production deployment guide: https://www.rabbitmq.com/production-checklist.html that I followed to replace our handbuilt cluster w/ a new automated deployment, plus a few other niceties like docker logs & rmq metrics to cloudwatch and then auto clustering via autoscaling groups lookup.
I thoroughness of the docs can perhaps seem daunting, but I see it as a badge of quality and especially if you are growing it's usage organically it should "just work".
If you're bad at hosting and need the throughput, there's cloudamqp.
So many options for pub/sub systems so use what works for you.
I feel like most continually-running backends will make use of RabbitMQ/NATS/ZeroMQ/etc, or more and more I see lightweight systems going completely serverless and just using lambdas - which are HTTP microservices.
In my last job, we used Rabbit to move about 15k messages per sec across about 2000 queues with 200 producers (which produced to all queues) and 2000 consumers (which each read from their own queues). Any time any of the consumers would slow down of fail, rabbit would run out of memory and crash, causing sitewide failure.
Additionally, Rabbit would invent network partitions out of thin air, which would cause it to lose messages, as when partitions are healed, all messages on an arbitrarily chosen side of the partition are discarded. (See https://aphyr.com/posts/315-jepsen-rabbitmq for more details about Rabbit's issues and some recommendations for running Rabbit, which sound worse than just using something else to me.)
We experimented with "high availability" mode, which caused the cluster to crash more frequently and lose more messages, "durability", which caused the cluster to crash more frequently and lose more messages, and trying to colocate all of our Rabbit nodes on the same rack (which did not fix the constant partitions, and caused us to totally fail when this rack lost power, as you'd expect.)
These are not theoretical problems. At one point, I spent an entire night fighting with this stupid thing alongside 4 other competent infrastructure engineers. The only long term solution that we found was to completely deprecate our use of Rabbit and use Kafka instead.
To anyone considering Rabbit, please reconsider! If you're OK with losing messages, then simply making an asynchronous fire-and-forget RPC directly to the relevant consumers may be a better solution for you, since at least there isn't more infrastructure to maintain.
This worked fine until things got behind and then we couldn't keep up. We were able to work around that by using a hashed exchange that spread messages across 4 queues. It hashed based on timestamp inserted by a timestamp plugin. Since all operations for a queue happen in the same event loop, any sort of backup led to pub and sub operations fighting for CPU time. By spreading this across 4 queues we wound up with 4x the CPU capacity for this particular exchange. With 2000 queues you probably didn't run into that issue very often.
> Any time any of the consumers would slow down of fail, rabbit would run out of memory and crash, causing sitewide failure.
Not to be glib, but in any brokered system, you to have enough (memory and disk) buffer space to soak up capacity when consumers slow down, within reason. Older (2.x) RabbitMQs did a very poor job rapidly paging queue contents to disk when under memory pressure. Newer versions do better, but you can still run the broker out of memory with a high enough ingress/low enough egress, which brings me to...
It sounds like you did not set your high watermarks correctly (another commenter already pointed this out); RabbitMQ can be configured to reject incoming traffic when over a memory watermark, rather than crash.
However, a couple of things can complicate this: rejection of incoming publishes on already-established connections may not make it back to your clients, if they are poorly behaved (and a lot of AMQP client libraries are poorly behaved) or are not using publisher confirms. Additionally, if your clients do notice that this is happening and continually reattempt to reconnect to RabbitMQ to handle the (actually backpressure due to memory) rejection notification, this connection churn can put massive amounts of strain on the broker, causing it to slow down or hang. In RabbitMQ's defense, connect/disconnect storms will damage many/most other databases as well.
> We experimented with ... "durability", which caused the cluster to crash more frequently and lose more messages
A few things to be aware of regarding durability:
Before RabbitMQ 3-point-something (I want to say 3.2), some poorly chosen Erlang IO-threadpool tunings caused durability to have higher latency than expected with large workloads. Anecdotally, the upgrade from 3.6 to 3.7 also improved performance of disk-persisted workloads.
If you have durability enabled, you should really be using publisher confirms (https://www.rabbitmq.com/confirms.html) as well. This isn't just for assurance that your messages made it; without confirms on, I've seen situations where publishers seem to get "ahead" of Rabbit's ability to persist and enqueue messages internally, causing server hiccups, hangs, and message loss. That's all anecdotal, of course, but I've seen this occur on a lot of different clusters. Pub confirms are a species of backpressure, basically--not from consumers to producers, but from RabbitMQ itself to producers.
When moving a high volume of non-tiny messages (where tiny is <500b), you really need a fast disk. That means the equivalent of NVMe/a write-cache-backed RAID (if on real hardware; ask me about battery relearns killing RabbitMQ sometime ... that was a bad night like the one you described), or paying attention to max-throughput/IOPS if deploying in the cloud (for example, a small EBS gp2 volume may not bring enough throughput, and sometimes you may need to RAID-0 up a pair of sufficiently-sized gp2's to get what you need). And no burst IOPS, ever.
> We experimented with "high availability" mode
You're 100% right about this. RabbitMQ's story in this area was pretty bad until recently. Quorum queues and lots of internal improvements have made the last ~4 years worth of the Rabbit versions behave better in HA configurations. But things can still get really dicey. Always "pause minority" (trade away your uptime for message loss), as the Jepsen article you linked mentioned.
For failure recovery (though it's not that "HA") if you can get single-node durability working well and are using networked disks (e.g. NFS, EBS) or a snapshotted-and-backed-up filesystem, one of the nice things about RabbitMQ's persistence format is that at the instant of crash, all but the very most recent messages are recoverable in the storage layer. That doesn't solve availability, but it does mean you don't have catastrophic data loss when you lose a node (restore a snapshot or reattach the data volume to a replacement server).
There's a different architecture for:
* one queue with billions of messages
* a millions of queues with small numbers of messages per queue
* many queues with many messages per queue
There are also different topologies:
* Anyone can send a message to anyone (O(n^2) queues)
* One publisher with millions of subscribers
* One subscribed with millions of publishers
* Complex processing networks, where messages get routed in complex ways between processing nodes.
There are differences in timing:
* More-or-less instant push notifications
* Jobs which run within e.g. 5 minutes with polling
* Jobs which run in hours/days, with a cron-style architecture
And in reliability:
* Messages get delivered 100% of the time, and archived once delivered
* Messages get delivered 99.999% of the time, but might be dropped on a system outage
* ... all the way down to ephemeral pub-subs
... and so on.
I'd give my VP's right eye to get a nice chart of what supports what. For the most part, I've found build to be cheaper than buy due to lack of benchmarks and documentation for my use cases. Otherwise, you build. You benchmark. You optimize. And things melt down.
My use case right now requires a large number of queues (eventually millions). I'd like to have an archival record of messages. Peak volume is moderate (several messages per second per queue), but usage patterns are sporadic (most queues are idle most of the time). Routing is slightly complex but not supper-complex (typically, about 30 sources per sink, at most 200; most sources only go to one sink, but might go to 2-3). Messages are relatively small (typically, around 1k), but isolated messages might be much bigger (still <1MB, but not small).
My experience has been that when I throw something like that into pick-your-queue/pub-sub, things melt down at some point, and building representative benchmarks is a ton of work.
Likewise this configurability makes case specific benchmarks very awkward.
People will compare it to Kafka, claiming that its pubsub is faster than Rabbit's, but that's sort of missing the point: Rabbit thrives because it's easy to set up, will work well for 99% of cases, and handles nearly every kind of distributed problem you're likely to come across.
I recently did a project with Rabbit on my home server, and while the project had some issues, the issues were never Rabbit.
I've been using Rabbit in production for RPC and pub/sub for the past 5 years (single instance running on a non-dedicated VM, medium traffic) and its been pretty easy to setup and has been pretty reliable in practice.
I've always been concerned about losing messages, and I did have to learn to turn on persistence and durability for messages to survive server interruptions, but it was easy enough. Message acknowledgements are also a nice feature, and Rabbit is able to achieve at-least-once messaging semantics.
That said, for most small to medium-large tasks, Rabbit will handle things without much trouble, making it a good fit for most common usecases.
Anyone considering RabbitMQ needs to read up on "network partitions", how to build your cluster to avoid them (odd number of nodes and pause_minority), your recovery strategy for when a network partition occurs (it will occur), your personal/organizational tolerance for message loss and a plan for how you will upgrade your cluster at some later date (ensure you architect your application to handle whatever type of upgrade strategy you will pursue).
There are definitely ways to operate to minimize these failures but you SHOULD KNOW ABOUT THEM before your add this service to your environments.
I'm looking to pitch adding a message queue to our infrastructure (at a .Net shop on Azure), and I'm sure there will be some questions about the comparisons between the two. Unfortunately, that's been tough to really track down.
From a ten thousand foot view, two or three node clusters running in non-prod environments on virtual machines running Windows. In Prod, three node clusters on Windows virtual machines.
All work to install and configure RabbitMQ is done manually. Sadly enough.
I'm on the application/architecture side of this equation but I know enough about our infrastructure to perhaps answer follow-ups or more specific questions.
Our application is single tenant (so each customer is deployed in their own isolated area) so we use virtual hosts to isolate each customer within the cluster.
>Also if you've done any comparison to Azure Message Queues and what were the pros and cons against Rabbit?
Definitely looked in to the Azure native queueing options but it's been awhile. Azure Message Queues is an AMQP compliant messaging system that seems fairly robust. To be transparent, I have no production experience with this product. If your company/department is in to managing virtual machines then they might want/prefer to go with RabbitMQ. However, if they're in to PaaS systems then I'd probably roll with Azure Message Queues and never look back.
:)
For those noting HA and scalability, it not meant for those use cases where (virtually infinite) horizontal scalability are the biggest concern. If you need horizontal scalability at a massive scale, use Kafka. But for the majority of cases, you can get away with limited scalability and the prod setup, development experience, and reliability of RabbitMQ are unmatched from my experience.
Rabbit seems to be the right path but I'm worried about scaling out as many sources seem to point as Kafka being more scalable (at least horizontally). I've been looking into Rabbit's Federation but it's still not clear if that will solve the problem down the road.
Can anyone shine some light?
RabbitMQ and Kafka are very different struggles when thinking of scaling and performance. Kafka is almost a database itself of messages which have routed through the system. In many configurations clients can come back and demand to replay the message stream from almost any point in time. This means you need to handle _a lot_ of disk and memory access. With RabbitMQ, messages are traditionally very ephemeral. Once a message has been ack'd, its gone. Poof. Not in memory. Not on disk. Nobody is going to come back asking for that message. This leads to a lot more efficiency in handling things per message, but at the cost of not being able to remember the messages that went through the system a few milliseconds ago.
I have a system that only pushes 5k messages per second but it needs 32 cores.
If you are moving to a more "event-sourced" architecture, usually two main concerns (beyond basic operational stuff of uptime, scale, etc) are routing and long-term retention.
RabbitMQ has the routing but not the retention. Kafka can have the retention and the routing, but it can be complex/expensive. Apache Pulsar really shines here as the API is pub/sub but it is underpinned by a log structure that gives you long-term retention (that doesn't need to be manually re-balanced) but it's flexibility does come with some operation complexity when compared to RabbitMQ.
If your needs is pretty much just moving large amounts of data, Kafka is definitely the most mature and has a big ecosystem, but long term-retention is difficult and there are some sharp edges around consumer groups.
If you really really don't need long-term retention and need complex topologies, RabbitMQ is your best bet and is fairly reasonable to operate even up to fairly high message rates (~10k msgs/sec shouldn't be too hard to achieve)
There are a TON more options these days though, older more java solutions like activeMQ and rocketMQ or more "minimal" implementations like NATs, not to mention the hosted services on cloud providers.
Personally, I am a big fan of Apache Pulsar for it's flexibility and some nice design choices, but I don't think there is any silver bullet in this space.
I think pulsar is wonderful, but I haven't had the chance to use it for anything serious / in production yet, so I'm curious what pain points you had.
We often lost entire queues because a small network blip caused RabbitMQ to think there was a network partition, and when the other nodes became visible, RabbitMQ has no reliable way to restore its state to what it was. It has a bunch of hacks to mitigate this, but they don't solve the core problem; the only way to run mirrored queues ("classic mirrored queues", as they're not called) reliably is to disable automatic recovery, and then you have to manually repair RabbitMQ every time this happens. If you care about integrity, you can use the new quorum queues instead, which use a Raft-based consensus system, but they lack a lot of the features of the "classic" queues. No message priorities, for example.
I've never used federation or Shovel, which are different features with other pros/cons.
If you're willing to lose the occasional message under very high load, NATS [3] is absolutely fantastic, and extremely fast and easy to cluster. Alternatively, NATS Streaming [4] and Liftbridge [5] are two message brokers built on top of NATS that implement reliable delivery. I've not used them, but heard good things.
[1] https://www.rabbitmq.com/partitions.html
[2] https://www.rabbitmq.com/quorum-queues.html
[3] https://nats.io/
I can offer a similar anecdote: we started seeing rabbitmq reporting alleged cluster partitions in production after enabling TLS between rabbitmq nodes, where manual recovery was needed each time.
After a bit of investigation we noticed that cluster partition seemed to correlate with sending an unusually large message (think something dumb like 30 megs) through rabbitmq when TLS between rabbitmq nodes was enabled. What I believe was happening was Rabbitmq was so busy encrypting/decrypting large message that it delayed sending or receiving heartbeat & then the cluster falsely assumed there has been a network partition.
Mitigated that issue by rewriting system to not send 30 meg messages- there was only one message producer that sent messages anywhere near that large, and after a bit of thought realised it was not necessary to send any message at all in that case (sending large message was to hack around some other old system performance problem that had gotten fixed properly a year back, but the hack that generated a huge message was still in place)
nowadays? it's actually quite simple to setup and works pretty well (source: i know two different companies that setup clustering recently and both had good experiences with no downtime).
1) Much easier to implement and maintain for small to medium architectures. However, war stories I've heard is that it starts to become a hassle for large clustering architectures.
2) Because it's a traditional message broker, the input and output ends, which I was responsible for, were much simpler to write because I didn't have to worry about replays when it came back online. Rabbit knows which client it has already routed to and where messages went. Kafka is not that sophisticated in that regard. Kafka has been described as "dumb broker/smart clients" while Rabbit is "smart broker, dumb clients."
3) The scaling. Rabbit is very scalable. Once you get to the Uber/Paypal level (like, a couple of million writes per second), then Kafka becomes the obvious choice. Rabbit handles thousands or writes per second just fine. However, at that second company and like many others, they thought they'd have to suck up all the data, so of course, Kafka was the more scalable tool long term. Spoiler: We were never, ever close to PayPal-level transactions. If the size of the sun represents paypal/Uber transactions, we were basically Manhattan.
There will be times when you lose offsets or when you actually want to replay every message, so take an hour and figure out what that means to your app. It's usually only a few lines of code in your consumer that compares source timestamps, but it's by far the most beneficial thing you can do when working with Kafka in my experience.
It's also relatively easy to hit "tens of thousands" messages/second, especially in replay or bootstrapping scenarios, and that's when Kafka becomes useful to the non-FAANG companies.
I've seen quite a lot messages going through RabbitMQ. I wouldn't worry too much about scaling, because the possibilities depend very much on the architecture. With some tuning RabbitMQ can take you a long way. I would give clustering a go and see where the limits are before exploring more complicated architectures like federation.
The clustering might look tempting but it hasn't been resilient for me in the face of janky networks. Split brains and data loss can result.
In the past I've scaled my rabbits for throughput by implementing my own routing/sharding layer.
If you're tempted to use the message persistence and you care about retaining messages, kafka is a bigger but much more capable hammer.
By default it only retains non-acked messages, multiple subscription modes, can use non-persistent messaging, dead letter queue, scheduled delivery, can use Pulsar Functions to implement custom routing etc.
Scales like Kafka (probably better) and has cluster replication built in.
Kafka creates the abstraction of a persistently stored, offset-indexed log of events. You read all events in a topic. Kafka can be used to distribute messages in the way AMQP is used, but is more likely to be the centerpiece of an architecture for your entire system where system state is pushed forward/transformed by deterministically processing the event logs.
Some of that joy is surely just moving from older, creakier solutions. But it hasn't let us down, and everyone is eager to use it for new features or refactoring legacy code.
It’s basically caught between being too bloated and complex for use with smaller systems (as some commenters have poked at people for not being the ‘right’ kind of person to be running it)
While at the same time, it’s not robust and reliable enough to use in prime time.
What’s left is this enticing and sexy sounding message broker called RabbitMQ that actually just sort of sucks.
In my experience someone gets stoked on trying this out but once everything is all implemented it disappoints and the system or service it is apart of is a one off after future services use something more mature the next time around.
For scale I have used NSQ to handle millions of message a second and then for smaller scale AWS services like SQS can handle things much more reliably.
Any ideas on how to even debug this type of thing? Help! We think it might be a tcp connection failure but we have no idea.
Depending on your scale, we find SQS is cheaper than a managed rabbit service. Although I'd be interested in using kafka!
1. If you have large messages and use keepalives (and you'll need keepalives), you need to write your own message fragmentation.
2. There are no python libs that just work. I'm currently using a vendored version of amqpstorm with a bunch of hacks to handle wedged connections. I have some AMQP connections that are intercontinental, and I've been able to wedge literally every other AMQP library.
3. If you have a single open connection, it will get stuck from time-to-time. With a bunch of both in-band and out-of-band keepalives, I've got it to the point where I don't have things permanently block, but you should expect things getting stuff for ~2x your heartbeat time periodically. This doesn't seem to result in message loss. I've dealt with this by just running LOTS of concurrent connections, and aggregating them client side. This has worked fine.
4. In general, exactly-once delivery isn't a thing. You should design either for at-most-once, or at-least-once delivery modes exclusively. Idempotency is your friend.
5. The tooling /around/ the rabbitmq server is a dumpsterfire.
Basically, I feel like the core server is super durable (note: I'm not running a cluster, so this doesn't generalize to multi-instance cases), but the management stuff is god-awful. The main management CLI tool actually calls the HTTP interface, which is kind of ridiculous. I've occationally run into a situation where I wound up with leaking temporary exchanges, and just flushing bogus exchanges is super annoying.
I don't think there's any other options that can do what rabbitmq does for my use-case, but it's had quite the learning curve.
I'm confused by what you mean by that. Do you mean "large" as in "take a long time to process in the consumer"? If so, and if your consumer is not issuing heartbeats concurrently with message processing, then that is true.
> There are no python libs that just work.
Completely agree. Having hacked on and patched the code inside Celery, it's really quite a bummer. I think this is because the Python libs try to abstract over things that ... just straight up can't be abstracted away given the semantics of AMQP: specifically connection-drop-detection, "resumption" of a consume (not really possible; this isn't Kafka), and the specific error code classes (connection-closed vs channel-closed vs information).
> If you have a single open connection, it will get stuck from time-to-time.
Are you talking about publishing connections? Consuming connections? One used for both? What does "stuck" mean? I'd be interested in hearing more about this.
> exactly-once delivery isn't a thing
Kinda pedantic, but exactly once delivery is possible in some very restricted situations (see Kafka's implementation of this guarantee: https://www.confluent.io/blog/exactly-once-semantics-are-pos...). Exactly once processing is what's tough-née-impossible. So yeah, idempotence is great.
By large, I mean 10+ MByte.
> Completely agree. Having hacked on and patched the code inside Celery, it's really quite a bummer.
I don't understand what the point of celery is. Literally everything I do requires /some/ persistent state in the workers, and there's no way to do that with celery.
> Are you talking about publishing connections? Consuming connections? One used for both? What does "stuck" mean? I'd be interested in hearing more about this.
TCP connections. As in, a connection to the server from a consumer. High latency connections seem to exacerbate the issue.
I think the issue is the state machines server-side and client-side get out of sync, and things just stop until the keep-alives/heartbeat cause the connection to reset, but that's a bunch of time to wait with no messages.
I also ran into the issue that basically every python library had at least one or two locations where `read()` was called without a timeout, but that was at least easier to fix.
> Kinda pedantic, but exactly once delivery is possible in some very restricted situations (see Kafka's implementation of this guarantee: https://www.confluent.io/blog/exactly-once-semantics-are-pos...). Exactly once processing is what's tough-née-impossible. So yeah, idempotence is great.
Well, it isn't really a thing, so you at least shouldn't depend on it being a thing for your architecture if possible.
Kafka is a stream, and can be replayed (if you have it set up to store stuff). Rabbit is simply a queue, and when the messages are gone, they're gone.
This means that queues are a lot smaller, but can only serve one set of consumers at at time. If you want to have multiple things listening to messages, you have to use fan-out patterns that place messages on multiple queues. Queues can also suffer from less than atomic delivery, especially if the system is distributed. This means you have to jump through some hoops and add an atomic layer somewhere if you want to ensure you're not double processing anything.
Kafka can have infinite retention (if you got the storage/$), and you don't need to have multiple streams to service multiple consumers. Each consumer stores where they are in the stream, and can traverse as needed. You'll need to be careful to make sure that a single consumer is handling a single partition to promise that you'll only process a message once.
Managing streams can be a headache, but less so now if you have money to have Amazon or Confluent manage it for you. They offer pretty much unlimited scalability, and are the production grade solution for a ton of problems.
Queues are really simple to understand and build and still scale pretty dang well. Just make sure your message processing is idempotent and make sure you can handle if something is processed multiple times.
RabbitMQ has excellent support for complex message flow topologies. Kafka out of the box does not provide these features.
Jack works with me on the RabbitMQ core engineering team. We've been hard at work to address a lot of the issues brought up in comments here. It's worth it to try out our latest releases. The engineering team is very active with the community and takes all constructive, helpful (i.e. reproducible) feedback seriously. Feedback is encouraged via the rabbitmq-users mailing list. Thanks.
Ended up looking into `rq` and `arq` which were both excellent!
https://arq-docs.helpmanual.io/
Would recommend if you're looking for a (faster) worker queue without all the overhead (in my case, didn't need all the other features that came w/ RabbitMQ so this got the job done).
ZeroMQ was born out of a frustration with complex routing patterns and the need for a broker-less architecture for maximal performance message delivery.
RabbitMQ is a messaging system. ZeroMQ is sockets on steroids.
Our company has been using ZeroMQ for over 8 years. We'll be putting out another ZeroMQ-based open source project soon too.
If you use the brokerless model, there was a bit of drama over ZeroMQ — the original technical developer (Martin Sustrik) left and created a successor, nanomsg, with what he learned. At some point, Martin lost interest, and Garrett D’Amore took over maintenance and did a rewrite called nng. Both the old nanomsg and nng are maintained, with nng being somewhat actively developed, but also fairly “complete”, so there’s not a lot of excitement like you see with some projects. ;) nanomsg and nng are essentially wire-compatible, so you can mix and match depending on bindings availability for your language.
ZeroMQ certainly isn't perfect, for example there's no way to tell if a message was successfully written to a PUB socket, or if it was dropped (just one minor issue)
https://stackoverflow.com/questions/13891682/detect-dropped-...
Anyway, This is digressing from the main topic
We've been working with ZeroMQ a lot over the past couple of years, and have gotten to know some of the maintainers -- we've been very favorably impressed by their ability and dedication.
Pieter Hintjens was the "voice" of ZeroMQ, and with his passing things have gotten a bit quieter, but no less active. (Just take a look at the commit log: https://github.com/zeromq/libzmq/commits/master).
My experiences were pretty mixed. Overall I found it to be more difficult than I would have wanted to get simple things to work. Part of this seems to be a problem with the Java library, which is not great. For example, IIRC you have to be really careful not to create the same queue twice, even with identical configurations, since the second time something blows up. At the end of the day just a simple fan-out configuration ends up involving a lot of somewhat-intricate code. It definitely does not Just Work (TM).
And then there was the bizarre hangs that I would experience during testing. I set up a Docker Compose configuration so that I could test the various parts of the system independently. It included one container running RabbitMQ to simulate the cluster we have running on our cloud.
Usually tests ran fine. But then, from time to time, the client would just hang trying to send a message through RabbitMQ. Unfortunately, again, the code you need to just run a basic configuration using RabbitMQ is complex enough that at first I was pretty sure that I had done something wrong. But after a few hours of increasing frustration I finally broke down and discovered that a simple test case that just sent a single message using code torn right out of the docs would hang. Forever. (Or, long enough that I gave up waiting.)
After a lot of digging I found the culprit. RabbitMQ will just take its ball and go home if the broker doesn't have enough disk space. Given that I use Docker heavily for a lot of projects, the amount available to new containers would vary a lot depending on what other data sets I had loaded or how recently I had run docker system prune.
I filed an issue about this, asking to have a better error message displayed when an attempt to send a message was made. The response was: there's already an error message, printed during startup. You didn't see it? No. I must have missed it among the hundreds of other lines of output that RabbitMQ spews when it starts.
Overall my favorite part of this story is that RabbitMQ chooses to start but refuse to send messages when low on disk space, when just crashing would be much more useful and make it much easier to pinpoint what was going on.
Anyway, I'm in the market for a simpler alternative that's Kotlin friendly.
https://semaphoreci.com/blog/2017/03/07/making-mailing-micro...
Only pitfall are the available libs. Especially with the .NET implementation we had quite a lot of trouble. Its not following current .NET patterns and has strange quirks. Does anyone know a good alternative to the "official" one?
It would be great to get specific, actionable feedback with your experience, either via a message to the rabbitmq-users mailing list or via a GitHub. The .NET client is an old library but considerable effort into improvement went into version 6.0. The plan for 7.0 is to address old patterns that remain in the library. Feedback would help guide that effort.
I just released version 6.1.0-rc.1 and would appreciate testing if you have time. Thanks!
If the library were being designed from scratch today, pretty much every method on the model would be Async. After all, if it leads to any network I/O of any kind, that can block.
Working with the current public API, Trying to implement a publish wrapper that never blocks, and returns a task that either completes when the publisher confirm is received, or faults after some provided timeout, is a lot trickier than it might sound.
Recovery from network interruptions is complicated, and auto-recovery features are limited, and in some use cases actually dangerous. For example, if you are manually acknowledging messages to ensure end-to-end at-least-once delivery, then you cannot safely use the auto-recovery, since the delivery numbers would reset when the connection does, and you can accidentally aknowlodge the wrong message with delivery tag 5. (Acknowledge the new one, when you were trying to ack the old one).
In my implementation of that included my own recovery, I ended up needing to pass around the IModel itself with the delivery tags, so I can check if the channel I am about to acknowledge on is really the same one I received the message on. (There is no unique identifier of a channel instance, since even the channel number is likely to get re-used).
You don't have to pool connections as channels are multiplexed by them.
Things to watch out for:
- opening too many channels - these map to Erlang processes and can overwhelm your server if you go over ulimits - sharing consumer channels between threads - you might see weird behavior (e.g. acking wrong messages etc)
We've built own library/framework for creating resilient consumers, and it enforces mapping 1:1 channels and consumer threads, as well as automatic reconnections and channel clean ups.
The general takeaway from this should be: if you've got a particular stream of messages (either a producer or a consumer) that pushes many thousands or even tens of thousands of messages per second, use a separate TCP connection. For anything else that is slower (dozens of messages per second), multiple channels on the same connection work great.
One last consideration is that when a given channel misbehaves or you perform an operation that the broker doesn't like, the only recovery that I've seen is to shut down the entire connection which can affect others channels on the same connection.
Let's say you've got one exchange and one main queue for processing: jobs.exchange and jobs.queue respectively.
If you need to schedule something for later, you'd assert a new queue with a TTL for the target amount of time (scheduled-jobs-<time>.queue). Also set an expiry of some amount of time, so it'd get cleaned up if nothing had been scheduled for that particular time in a while. Finally, have its dead-letter-exchange set to jobs.exchange.
This could lead to a bunch of temporary queues, but the expiration should clean them up when they haven't been used for a bit.
https://docs.celeryproject.org/en/v2.3.3/userguide/periodic-...
I really don't get why they don't publish at least the windows binaries with their github releases.
You move this to a queue, and have a worker chop that data file up into individual records, those records go onto a queue, and you can process them however you want, no worries about something crashing and not being able to be retried. If the database goes down, everything just pauses until it can go again. You can limit the queue throughput to whatever you want to avoid having to scale your API/Database.
Can you handle stuff via all CRUD sync APIs? Sure, just like you could handle running a restaurant where you have one person who takes the order and cooks it and delivers it to a table. However, it's more efficient to have a waiter (API) take requests and give them to a cook (queue based async worker) to handle stuff that's not as time sensitive. This saves you a lot of money in certain situations.
So, for example, you would have a CRUD that takes requests, and when there is background work to be done, places a message on the queue, and immediately returns to the user. This frees up the server for more requests. Meanwhile in the background, a worker process chugs through the queue and does its work. During long spikes it will take longer to get through the queue, but your end users will not have disruption of service.
RabbitMQ itself is great, but there are some downsides to this architecture:
* Lots of tooling (for blue/green deployments, load balancing, autoscaling, service meshes etc.) assumes HTTP(s)+JSON or GRPC these days
* Getting people who aren't deep into software engineering to write a service that connects to RabbitMQ has a much higher perceived hurdle than making them write a HTTP service
* Operations is different than with HTTP-based services, and many operators aren't used to it
TL;DR: it's more of a niche product for inter-service communication, which comes with all of the problems that niche products typically face.
Also, see the following articles:
Laika - https://www.rabbitmq.com/blog/2019/12/16/laika-gets-creative...
Bloomberg - https://tanzu.vmware.com/content/rabbitmq/keynote-growing-a-...
Goldman Sachs - https://tanzu.vmware.com/content/rabbitmq/keynote-scaling-ra...
Softonic - https://www.cloudamqp.com/blog/2019-01-18-softonic-userstory...
There are some big companies talking about their experience.
Reader Mode is the answer. The creator of that deserves a Nobel Prize.