The fixed priority systems the article talks about trade off optimal "capacity" utilization for understandable failure dynamics in the overcapacity case. When you're over capacity, the messages that don't go through are the messages with the lowest priority. It's a nice property and simple to implement.
What the article proposes is better known as deadline scheduling. That's also fine and widely used, but it has more complicated failure dynamics in the overcapacity case. If your problem domain doesn't have an inherent "priority" linked to the deadlines, that may be acceptable, but in other cases it may not be.
Neither is inherently better and there's other approaches with yet different tradeoffs.
I suspect you intimately need a little language for this. Decoding priorities for frames in a video stream are already complex and those are trivial compared to the sorts of scenarios Conway’s Las introduces.
> I don’t think it’s possible to meet a latency target through prioritization when there is a fundamental lack of capacity.
This seemingly implies that it would be achievable with target deadlines/SLOs. At capacity I don't see a better solution than give each priority (or target latency) a defined minimum resource allocation.
The “pain” experienced in an overload situations is spread among all late jobs. Contrast this with fixed-priority scheduling, where lowest priority jobs will be starved completely, until the overload is resolved.
[0] https://en.m.wikipedia.org/wiki/Scheduling_(computing)
[1] https://onlinelibrary.wiley.com/doi/book/10.1002/97811189844...
Let's say you have two types of jobs: one that is highly parallelizable (Eg request to third party API), and one that doesn't parallelize well (video transcoding using 100% of available CPU). Then you'd want a lot of threads assigned to each CPU for the first type, but very few for the second type. If they're on the same queue, you can have a few threads and be blocked by job type one most of the time, or have a lot of threads and get tons of context switching when the second type of job dominates. Either way your throughput will be very bad.
My current idea is to partition job types first by their degree of parallelizability (as in Amdahl's law), then by priority if necessary.
Has anyone tried this?
My queuing theory notes and notation are here:
Most queue processing services that I have seen have an alarm on (a) oldest message age, and (b) number of messages in the queue.
In every team I joined I have quickly added a custom metric (c) that subtracts the time of successful processing from the time that a message was /initially/ added to the queue. This metric tends to uncover lots of nasty edge cases regarding retries, priority starving, and P99 behavior that are hidden by (a) and (b).
Having 100000 messages in the queue is only an issue if they are not being processed at (at least) 100000/s. Having a 6-hour-old message in the queue is concerning, but maybe it is an extreme outlier, so alarming is unnecessary. But you can bet your bottom dollar that if your average processing latency spikes by 10x that you want to know about it.
The other thing that is nice about an end to end latency metric is that (a) and (b) both tend to look great all the way up to the point of failure/back pressure and then they blow up excitingly. (c) on the other hand will pick up on things like a slight increase in application latency, allowing you to diagnose beforehand if your previously over-provisioned queue is becoming at-capacity or under-provisioned.
I was just talking with a Temporal solutions engineer this week and this metric is their recommended one for autoscaling on. Instead of autoscaling on queue depth, you scale on queue latency! Specifically for them they split up the time from enqueue to start, and then the time from start to done, and you scale on the former, not the total ("ScheduleToStart" in their terms).
I had much better results with a metric that shows estimated queue time for jobs that are getting enqueued right now (queue_size * running_avg_job_processing_time / parallelism).
Aha! Just as the second season of Loki dropped. Makes sense now
Less sarcastically - this ties in with the article i guess. runat time is the enqueue, and then you are arguing for two latencies - time enqueue to start and start to complete.
Enterprise processes can wind between many intermediaries. Hours, days, weeks, maybe even months.
I used to love chained callbacks when I was 16, and later I thought threads were the greatest, and I've written a bunch of device drivers that operate at different IPLs.
But 20 years ago a cofounder made me realize that a long polling loop is easier and faster to write, and much easier to understand than threads. That insight has made countless projects simpler and easier and I recommend considering it. You may be surprised, as I was.
I am obviously referring to operations that do actual IO or wait for work on other threads only.
If you're working with the JVM, you can write simpler blocking code that is also performant through virtual threads.
Then it contains logic to handle events.
IMHO, the problem is that it's really hard for people to think in terms of "I want my job that takes 10 CPU seconds to be done in 300 wall clock seconds". In turn, what batch processing frameworks do, is they can estimate these things, and figure out where to place work. You can also do stuff like deny requests if there isn't capacity (because you know all the scheduled work for the next quanta).
For example, suppose you have a burst of 1 hour latency jobs, each of which processes in 10 minutes. It will not take many of these to consume all available workers.
If that burst is followed by a single high priority, 10s latency job. Whelp, that jobs latency objective will not be met, since the soonest that a worker will free up to take this work is 10 minutes.
So I think the ideal worker pool design does include some amount of reserved capacity for low-latency work.
A general purpose workers can of course grab low latency work if it's idle! But the reverse is not true - an idle low-latency worker should not be picking up any long-running job.
># If workers are busy, job2 will be run first in 11 minutes
># If workers are too busy, both jobs will exceed their max latency.
So... priorities for tasks in a background queue.
I agree explicit latency tolerance is often a great way to do this - it lets you know what you can relax and reschedule, and if it's far enough in the future you can predict load / scale preemptively. Plus it works without having to deal with defining who gets what priority integer (or when to subdivide floats). But it degrades to the same behavior as priorities.
Thanks to all commenters for sharing their experiences and constructive opinions. It shows that this post is incomplete and far from being perfect. So, I just wrote a post-scriptum to improve it a bit for future readers.
https://alexis.bernard.io/blog/2023-10-15-background-job-que...
A job needs two attributes to define when it should be started: run_at and max_latency. That means the job worker only needs to order them by run_at + max_latency, and takes the first. It seems both flexible and simple.
Just considering two jobs (run_at=10,max_latency=15), (run_at=11,max_latency=13), it's clear that following that approach, the first task would be unnecessarily blocked by the second, or you'd run jobs earlier than run_at specified.
Which is not to agree with the claim that latency queues and priorities can't achieve latency goals. Your hard requirements establish a minimum viable capacity, and you fill in the bubbles with softer work. Priorities let you distinguish between hard and soft, and to offer fairness among soft.
You need either dedicated workers for low latency tasks or some sort of preemption to meet SLOs with such heterogeneous tasks.
There's always an autoscaling delay, but Rails itself (and the community) don't seem to fit into the serverless paradigm well such that these questions around how to design your queues come up.
I think a lot of Lambda developers or Cloud Run developers would instead say "well my max instances is set to 500, I am pretty sure I'm going to break something else before I hit that", you know? Especially when using the cloud's nice integrations between their queues and their event-driven serverless products its super easy to get exactly as much compute as you need to keep your latency really low.