Background job queues and priorities may be the wrong path (opens in new tab)

(alexis.bernard.io)

114 pointsa12b2y ago50 comments

50 comments

The article is a bit unclear because it's lacking the proper vocabulary. Priorities and deadlines (what the article calls "SLOs") are both valid ways to approach scheduling problems with different tradeoffs.

The fixed priority systems the article talks about trade off optimal "capacity" utilization for understandable failure dynamics in the overcapacity case. When you're over capacity, the messages that don't go through are the messages with the lowest priority. It's a nice property and simple to implement.

What the article proposes is better known as deadline scheduling. That's also fine and widely used, but it has more complicated failure dynamics in the overcapacity case. If your problem domain doesn't have an inherent "priority" linked to the deadlines, that may be acceptable, but in other cases it may not be.

Neither is inherently better and there's other approaches with yet different tradeoffs.

dehrmann2y ago

I've seen enough versions of these systems and reached the conclusion that the best you can do is shift lower priority work off-peak, but you still need to be overprovisioned, or else a p99 even will stall the system with too much work. People will talk about priorities, but on the scale of one day, almost everything is high priority.

hinkley2y ago

IN multitenant situations this is even more complicated. Because I may have a time slice of the machine I’m “entitled” to and within that I have priorities with which I want to allocate things. If one of my tasks can’t complete, I hope it’s to the benefit of another of my priorities. Not someone else’s.

I suspect you intimately need a little language for this. Decoding priorities for frames in a video stream are already complex and those are trivial compared to the sorts of scenarios Conway’s Las introduces.

karmakaze2y ago

Yeah, I don't see how deadlines/SLOs does any better when there's insufficient capacity to meet those of queued jobs at their concurrency.

> I don’t think it’s possible to meet a latency target through prioritization when there is a fundamental lack of capacity.

This seemingly implies that it would be achievable with target deadlines/SLOs. At capacity I don't see a better solution than give each priority (or target latency) a defined minimum resource allocation.

arakageeta2y ago

Under deadline scheduling, every pending job _eventually_ has highest priority as time elapses. (Assuming new jobs can’t arrive with deadlines in the past.) Every job is eventually serviced.

The “pain” experienced in an overload situations is spread among all late jobs. Contrast this with fixed-priority scheduling, where lowest priority jobs will be starved completely, until the overload is resolved.

2 more replies

rewmie2y ago

Thanks for sharing some of your insight onto the problem domain. Do you happen to have any reference for laymen like me to onboard onto the subject? It's an awfully interesting topic but when I stumbled upon it I tend to fall back to JIT learning and troubleshooting, which is far from the best position to be.

AlotOfReading2y ago

This is a class of problems known as scheduling. The Wikipedia page is a good place to start [0]. It focuses on task schedulers rather than message queues, but the same principles apply. For a deeper theoretical basis, Wiley has a good book [1]. Most undergrad curricula will have a module on this as well, so you can find info in most comp sci textbooks.

[0] https://en.m.wikipedia.org/wiki/Scheduling_(computing)

[1] https://onlinelibrary.wiley.com/doi/book/10.1002/97811189844...

jayunit2y ago

The article from @jph (who commented it at the top level here) discussed a month ago https://news.ycombinator.com/item?id=37532439 might be helpful, too!

andrelaszlo2y ago

Another important aspect that seems to be overlooked in a lot of these discussions (specifically for Rails, but probably relevant in other contexts as well) is that if your jobs are very heterogeneous in terms of their degree of parallelizability then it will be very difficult to do effective provisioning and capacity planning.

Let's say you have two types of jobs: one that is highly parallelizable (Eg request to third party API), and one that doesn't parallelize well (video transcoding using 100% of available CPU). Then you'd want a lot of threads assigned to each CPU for the first type, but very few for the second type. If they're on the same queue, you can have a few threads and be blocked by job type one most of the time, or have a lot of threads and get tons of context switching when the second type of job dominates. Either way your throughput will be very bad.

My current idea is to partition job types first by their degree of parallelizability (as in Amdahl's law), then by priority if necessary.

Has anyone tried this?

deredede2y ago

Do not use the same queues for I/O jobs and compute jobs, they are not competing for the same resources.

andrelaszlo2y ago

Totally agree.

jph2y ago

There are better ways to manage and measure queues, IMHO, by using basic queuing theory such as 'ρ' for utilization, 'ε' for error rate, 'Aθ' for activity step time, and Little's Law.

My queuing theory notes and notation are here:

https://github.com/joelparkerhenderson/queueing-theory

timeagain2y ago

In a similar vein, something about queuing that has annoyed me as a developer for multiple large FANG corporations is poor thinking about queue metrics. The TLDR is that metics provided by the queue itself are rarely helpful for knowing if your service is healthy, and when it is not healthy they are not very useful for determining why.

Most queue processing services that I have seen have an alarm on (a) oldest message age, and (b) number of messages in the queue.

In every team I joined I have quickly added a custom metric (c) that subtracts the time of successful processing from the time that a message was /initially/ added to the queue. This metric tends to uncover lots of nasty edge cases regarding retries, priority starving, and P99 behavior that are hidden by (a) and (b).

Having 100000 messages in the queue is only an issue if they are not being processed at (at least) 100000/s. Having a 6-hour-old message in the queue is concerning, but maybe it is an extreme outlier, so alarming is unnecessary. But you can bet your bottom dollar that if your average processing latency spikes by 10x that you want to know about it.

The other thing that is nice about an end to end latency metric is that (a) and (b) both tend to look great all the way up to the point of failure/back pressure and then they blow up excitingly. (c) on the other hand will pick up on things like a slight increase in application latency, allowing you to diagnose beforehand if your previously over-provisioned queue is becoming at-capacity or under-provisioned.

hbrundage2y ago

Makes sense!

I was just talking with a Temporal solutions engineer this week and this metric is their recommended one for autoscaling on. Instead of autoscaling on queue depth, you scale on queue latency! Specifically for them they split up the time from enqueue to start, and then the time from start to done, and you scale on the former, not the total ("ScheduleToStart" in their terms).

ickyforce2y ago

Time from enqueue to start isn't a good metric - it completely disregards the queue size. Enqueuing 1M jobs won't change this metric as it only updates once the job reaches the front of the queue, and when the 1Mth job does that the situation is already over.

I had much better results with a metric that shows estimated queue time for jobs that are getting enqueued right now (queue_size * running_avg_job_processing_time / parallelism).

lifeisstillgood2y ago

>>> I was just talking with a Temporal solutions engineer

Aha! Just as the second season of Loki dropped. Makes sense now

Less sarcastically - this ties in with the article i guess. runat time is the enqueue, and then you are arguing for two latencies - time enqueue to start and start to complete.

timeagain2y ago

Exactly! Other queue metrics have too many false-positives and false-negatives.

j452y ago

Slow working processes are completely normal and required in the real world. It’s not an extreme outlier.

Enterprise processes can wind between many intermediaries. Hours, days, weeks, maybe even months.

paulsutter2y ago

If you want to write any complicated concurrent code, the simplest and best way is one long polling loop state machine. You might use a thread to call a blocking API, but the majority of the logic should be in a polling loop.

I used to love chained callbacks when I was 16, and later I thought threads were the greatest, and I've written a bunch of device drivers that operate at different IPLs.

But 20 years ago a cofounder made me realize that a long polling loop is easier and faster to write, and much easier to understand than threads. That insight has made countless projects simpler and easier and I recommend considering it. You may be surprised, as I was.

jupp0r2y ago

Ideally, you don't have blocking APIs at all. There isn't any reason nowadays. Every blocking API in your system is there for historic reasons (ie it's >20 years old) or because somebody made a terrible design decision in the last 20 years.

I am obviously referring to operations that do actual IO or wait for work on other threads only.

taeric2y ago

I don't know that I agree. Blocking logic is far easier to understand, as it lets you ignore partial operations. It also lets you specify, in a more direct manner, where data is buffered.

1 more reply

okeuro492y ago

> Ideally, you don't have blocking APIs at all. There isn't any reason nowadays.

If you're working with the JVM, you can write simpler blocking code that is also performant through virtual threads.

F-W-M2y ago

What do you mean by a long polling state machine? Is it just a state machine that gets fed via long polling http requests?

akvadrako2y ago

No, he means a run loop which polls for events, often by using poll() or select(), but it could also just be sleep() and then a SELECT statement.

Then it contains logic to handle events.

1 more reply

dehrmann2y ago

Half of what you're describing is an async event loop, but it isn't strictly tied to a state machine.

corytheboyd2y ago

I have yet to try it IRL, but I’ve always wondered if having a single hyper-scaled worker pool against a single queue would be best suited for a typical web app Do Things Async layer. Add a job timeout so that a single rogue process can’t saturate workers… okay and now add a long_jobs queue for the things that need to run longer… and one more for these other things that need to run longer than that… and… shit.

sargun2y ago

Deadline based scheduling actually lets you do super clever things like "time-shifting" -- you have 10 jobs, that each take 10 seconds, and the deadline is 300 seconds out -- you can fit them in between other jobs. In addition, if you know the requirement of the job in terms of computational needs, you can determine really efficient collocation patterns.

IMHO, the problem is that it's really hard for people to think in terms of "I want my job that takes 10 CPU seconds to be done in 300 wall clock seconds". In turn, what batch processing frameworks do, is they can estimate these things, and figure out where to place work. You can also do stuff like deny requests if there isn't capacity (because you know all the scheduled work for the next quanta).

paulgb2y ago

If the author is here: your CSS is breaking the code example. If you remove "white-space: wrap" from the rule `code[class="language-"], pre[class="language-"]`, it seems to do the trick (though I'm not sure if it breaks things on other pages).

a12bOP2y ago

Thanks !

hamandcheese2y ago

I largely agree, however I do believe that it is necessary to reserve capacity for lower latency jobs if the variance in job durations is large.

For example, suppose you have a burst of 1 hour latency jobs, each of which processes in 10 minutes. It will not take many of these to consume all available workers.

If that burst is followed by a single high priority, 10s latency job. Whelp, that jobs latency objective will not be met, since the soonest that a worker will free up to take this work is 10 minutes.

So I think the ideal worker pool design does include some amount of reserved capacity for low-latency work.

A general purpose workers can of course grab low latency work if it's idle! But the reverse is not true - an idle low-latency worker should not be picking up any long-running job.

Groxx2y ago

># If workers are quiet, job1 will be run first in 10 minutes

># If workers are busy, job2 will be run first in 11 minutes

># If workers are too busy, both jobs will exceed their max latency.

So... priorities for tasks in a background queue.

I agree explicit latency tolerance is often a great way to do this - it lets you know what you can relax and reschedule, and if it's far enough in the future you can predict load / scale preemptively. Plus it works without having to deal with defining who gets what priority integer (or when to subdivide floats). But it degrades to the same behavior as priorities.

a12bOP2y ago

Author here.

Thanks to all commenters for sharing their experiences and constructive opinions. It shows that this post is incomplete and far from being perfect. So, I just wrote a post-scriptum to improve it a bit for future readers.

https://alexis.bernard.io/blog/2023-10-15-background-job-que...

adrr2y ago

But what if I run background jobs to protect resources that can't easily scale wide like DB writes or calls to SAAS services that are API throttled.

1 more reply

ralferoo2y ago

Just quickly skimmed this, but it seems the conclusion is wrong:

A job needs two attributes to define when it should be started: run_at and max_latency. That means the job worker only needs to order them by run_at + max_latency, and takes the first. It seems both flexible and simple.

Just considering two jobs (run_at=10,max_latency=15), (run_at=11,max_latency=13), it's clear that following that approach, the first task would be unnecessarily blocked by the second, or you'd run jobs earlier than run_at specified.

alex-lx2y ago

I guess the actual way is like `where run_at <= now() order by (run_at+max_latency)`

svilen_dobrev2y ago

i guess the messages should be sorted by that tuple (run_at, max_latency), and only if they miss the run_at, then run_at+max_latency comes into play. Or something even more twisted.

markhahn2y ago

Well duh: you use prioritised queues precisely when there is a capacity limit. In lots of cases, the facility is specifically mandated to achieve as close to full capacity as possible.

Which is not to agree with the claim that latency queues and priorities can't achieve latency goals. Your hard requirements establish a minimum viable capacity, and you fill in the bubbles with softer work. Priorities let you distinguish between hard and soft, and to offer fairness among soft.

dudeinjapan2y ago

What is missing from this picture is idleness. For example, suppose I have a SLO 10 sec job A and SLO 5 min job B. If I only get a few Bs sporadically, I may want to define queue X=A only, and queue Y=A,B to use the idle compute to process more As. In the wild, this is a delicate balancing act.

deredede2y ago

Even if you get lots of Bs, if B takes 15s to run, and you ever get n_worker requests for B at the same time immediately followed by a single A, you lose, even with plenty capacity to spare.

You need either dedicated workers for low latency tasks or some sort of preemption to meet SLOs with such heterogeneous tasks.

dudeinjapan2y ago

Right you need to think about the queue wait time you are willing to tolerate, in addition the time it takes a job to run. If you arent willing to wait in queue then you will need to have idle capacity.

itake2y ago

This does not work if your upstream server can only handle X concurrency per second (think ML GPU) and you need to timeout the job before processing it.

jounker2y ago

Article telling you to avoid queues and priorities advises you to implement a priority queue.

ndriscoll2y ago

It's useful to have queues based on job type because it allows you to stream messages onto the queue as they come in, and then batch pull the work off for processing (many processes are more efficient--possibly vastly so--when run as a batch).

hbrundage2y ago

Its interesting to see how the Rails world still thinks in terms of the number of processes listening to a queue, instead of thinking in the cloud-native, elastic, serverless terms.

There's always an autoscaling delay, but Rails itself (and the community) don't seem to fit into the serverless paradigm well such that these questions around how to design your queues come up.

I think a lot of Lambda developers or Cloud Run developers would instead say "well my max instances is set to 500, I am pretty sure I'm going to break something else before I hit that", you know? Especially when using the cloud's nice integrations between their queues and their event-driven serverless products its super easy to get exactly as much compute as you need to keep your latency really low.

andrelaszlo2y ago

Yeah when your Rails monolith's image is several GiBs, uses roughly the same amount of memory, and takes almost a minute to cold start, autoscaling has a lot of inertia and gets pretty expensive.

j / k navigate · click thread line to collapse

50 comments

AlotOfReading2y ago

Neither is inherently better and there's other approaches with yet different tradeoffs.

dehrmann2y ago

hinkley2y ago

karmakaze2y ago

Yeah, I don't see how deadlines/SLOs does any better when there's insufficient capacity to meet those of queued jobs at their concurrency.

> I don’t think it’s possible to meet a latency target through prioritization when there is a fundamental lack of capacity.

arakageeta2y ago

Under deadline scheduling, every pending job _eventually_ has highest priority as time elapses. (Assuming new jobs can’t arrive with deadlines in the past.) Every job is eventually serviced.

2 more replies

rewmie2y ago

AlotOfReading2y ago

[0] https://en.m.wikipedia.org/wiki/Scheduling_(computing)

[1] https://onlinelibrary.wiley.com/doi/book/10.1002/97811189844...

jayunit2y ago

The article from @jph (who commented it at the top level here) discussed a month ago https://news.ycombinator.com/item?id=37532439 might be helpful, too!

andrelaszlo2y ago

My current idea is to partition job types first by their degree of parallelizability (as in Amdahl's law), then by priority if necessary.

Has anyone tried this?

deredede2y ago

Do not use the same queues for I/O jobs and compute jobs, they are not competing for the same resources.

andrelaszlo2y ago

Totally agree.

jph2y ago

There are better ways to manage and measure queues, IMHO, by using basic queuing theory such as 'ρ' for utilization, 'ε' for error rate, 'Aθ' for activity step time, and Little's Law.

My queuing theory notes and notation are here:

https://github.com/joelparkerhenderson/queueing-theory

timeagain2y ago

Most queue processing services that I have seen have an alarm on (a) oldest message age, and (b) number of messages in the queue.

hbrundage2y ago

Makes sense!

ickyforce2y ago

I had much better results with a metric that shows estimated queue time for jobs that are getting enqueued right now (queue_size * running_avg_job_processing_time / parallelism).

lifeisstillgood2y ago

>>> I was just talking with a Temporal solutions engineer

Aha! Just as the second season of Loki dropped. Makes sense now

Less sarcastically - this ties in with the article i guess. runat time is the enqueue, and then you are arguing for two latencies - time enqueue to start and start to complete.

timeagain2y ago

Exactly! Other queue metrics have too many false-positives and false-negatives.

j452y ago

Slow working processes are completely normal and required in the real world. It’s not an extreme outlier.

Enterprise processes can wind between many intermediaries. Hours, days, weeks, maybe even months.

paulsutter2y ago

I used to love chained callbacks when I was 16, and later I thought threads were the greatest, and I've written a bunch of device drivers that operate at different IPLs.

jupp0r2y ago

I am obviously referring to operations that do actual IO or wait for work on other threads only.

taeric2y ago

I don't know that I agree. Blocking logic is far easier to understand, as it lets you ignore partial operations. It also lets you specify, in a more direct manner, where data is buffered.

1 more reply

okeuro492y ago

> Ideally, you don't have blocking APIs at all. There isn't any reason nowadays.

If you're working with the JVM, you can write simpler blocking code that is also performant through virtual threads.

F-W-M2y ago

What do you mean by a long polling state machine? Is it just a state machine that gets fed via long polling http requests?

akvadrako2y ago

No, he means a run loop which polls for events, often by using poll() or select(), but it could also just be sleep() and then a SELECT statement.

Then it contains logic to handle events.

1 more reply

dehrmann2y ago

Half of what you're describing is an async event loop, but it isn't strictly tied to a state machine.

corytheboyd2y ago

sargun2y ago

paulgb2y ago

a12bOP2y ago

Thanks !

hamandcheese2y ago

I largely agree, however I do believe that it is necessary to reserve capacity for lower latency jobs if the variance in job durations is large.

For example, suppose you have a burst of 1 hour latency jobs, each of which processes in 10 minutes. It will not take many of these to consume all available workers.

If that burst is followed by a single high priority, 10s latency job. Whelp, that jobs latency objective will not be met, since the soonest that a worker will free up to take this work is 10 minutes.

So I think the ideal worker pool design does include some amount of reserved capacity for low-latency work.

A general purpose workers can of course grab low latency work if it's idle! But the reverse is not true - an idle low-latency worker should not be picking up any long-running job.

Groxx2y ago

># If workers are quiet, job1 will be run first in 10 minutes

># If workers are busy, job2 will be run first in 11 minutes

># If workers are too busy, both jobs will exceed their max latency.

So... priorities for tasks in a background queue.

a12bOP2y ago

Author here.

https://alexis.bernard.io/blog/2023-10-15-background-job-que...

adrr2y ago

But what if I run background jobs to protect resources that can't easily scale wide like DB writes or calls to SAAS services that are API throttled.

1 more reply

ralferoo2y ago

Just quickly skimmed this, but it seems the conclusion is wrong:

alex-lx2y ago

I guess the actual way is like `where run_at <= now() order by (run_at+max_latency)`

svilen_dobrev2y ago

i guess the messages should be sorted by that tuple (run_at, max_latency), and only if they miss the run_at, then run_at+max_latency comes into play. Or something even more twisted.

markhahn2y ago

Well duh: you use prioritised queues precisely when there is a capacity limit. In lots of cases, the facility is specifically mandated to achieve as close to full capacity as possible.

dudeinjapan2y ago

deredede2y ago

Even if you get lots of Bs, if B takes 15s to run, and you ever get n_worker requests for B at the same time immediately followed by a single A, you lose, even with plenty capacity to spare.

You need either dedicated workers for low latency tasks or some sort of preemption to meet SLOs with such heterogeneous tasks.

dudeinjapan2y ago

itake2y ago

This does not work if your upstream server can only handle X concurrency per second (think ML GPU) and you need to timeout the job before processing it.

jounker2y ago

Article telling you to avoid queues and priorities advises you to implement a priority queue.

ndriscoll2y ago

hbrundage2y ago

Its interesting to see how the Rails world still thinks in terms of the number of processes listening to a queue, instead of thinking in the cloud-native, elastic, serverless terms.

There's always an autoscaling delay, but Rails itself (and the community) don't seem to fit into the serverless paradigm well such that these questions around how to design your queues come up.

andrelaszlo2y ago

Yeah when your Rails monolith's image is several GiBs, uses roughly the same amount of memory, and takes almost a minute to cold start, autoscaling has a lot of inertia and gets pretty expensive.

j / k navigate · click thread line to collapse