What's good about offset pagination; designing parallel cursor-based web APIs (opens in new tab)

(brandur.org)

100 pointsclra5y ago51 comments

51 comments

To point out the obvious: generally API providers don’t particularly want you to pararelize your request (they even implement rate limiting to make it harder on purpose). If they wanted to make it easy to get all the results, they would allow you to access the data without pagination - just download all the data in one go.

eyelidlessness5y ago

A certain level of parallelism is generally within the realm of good API citizenship. Even naive rate limiting schemes tend to permit a certain number of concurrent requests (as they well should, since even browsers may perform concurrent requests without any developer intervention).

Rate limiting and pagination aren’t (necessarily) about making full data consumption more difficult. They’re more often about optimizing common use cases and general quality of service.

Edit to add: in certain circles (eg those of us who take REST and HATEOAS as baseline HTTP API principles), parallelism is often not just expected but often encouraged. A service can provide efficient, limited subsets of a full representation and allow clients to retrieve as little or as much of the full representation as they see fit.

corty5y ago

One thing that frequently bugs me is APIs limiting number of items per page for reasons of efficiency. I can perfectly understand low limits for other reasons, like not helping people scrape your data.

But limiting for efficiency is usually done in a way that I would call a cargo cult: First, the number of items per "page" is usually a number one would pick per displayed page, in the range of 10 to 20. This is inefficient for the general case, the amount of data transmitted is usually just the same size as the request plus response headers. So if the API isn't strictly for display purposes, pick a number of items per page that gives a useful balance between not transmitting too much useless data, but keeping query and response overhead low. Paginate in chunks of 100kB or more.

In terms of computation and backend load, pagination can be as expensive for a 1-page-query as for a full query. Usually this occurs when the query doesn't directly hit an index or similar data structure where a full sweep over all the data cannot be avoided. So think and benchmark before you paginate, and maybe add an index here and there.

jerf5y ago

Strongly agree. I have an API I work on where if you ask for a couple of gigabytes of data, it'll send it to you, because if you're asking for it, that's what you want. It gets streamed out, and the docs warn you that you will get exactly what you asked for, so you may want to chunk up on your side (for this particular API there is a trivial way for clients to do that), or if you can handle a full stream, go nuts.

Pagination would just complicate things. I think with most APIs, intended as APIs (i.e., not just an endpoint primarily meant to feed a front-end page), you're better off thinking of your default as "I'm going to just stream everything they ask for", and look for reasons why that won't work, rather than start from the presumption that everything must be paginated from the beginning.

Don't get me wrong; there are plenty of solid reasons to paginate. You may discover one applies to your API. But if you can leave it out, it's often simpler for both the producer and the consumer. Wait until you find the need for it. Plus, if that happens, you'll have a better understanding of the actual problem you need to solve and better solutions may reveal themselves.

Groxx5y ago

Pagination for a one-page-query is rarely the same cost in my experience, in real-world scenarios.

In very simple cases, like a single table sql query, absolutely - databases effectively have to compute the full result, sort it, and return a window. There's almost no reason to paginate here, at an API level, unless the consumer wants only a subset (say, bandwidth limitations). Sending it all at once can be a huge benefit for those that will use it all, it's both simpler and faster for all parties.

But in most real-world cases, there are at least two additional details that can add significant response time: joins (when not involved in sorting) and additional data-gathering needed to fully build the response (e.g. getting data from other systems, internal or external). Joined data is not typically loaded prior to computing limit/offset since it may be a massive waste, and external data is effectively the same issue but with far higher latency.

And that's before getting into other practical issues, e.g. systems that can't process the response stream as it comes in - a subset will load-and-return faster than the whole content in all cases, so e.g. a website loading some json can show initial UI faster while loading more in the background. Streaming is often possible and that'll negate a lot of the downsides, but it's far less common than processing a request only after it completes.

yxhuvud5y ago

Strongly disagree. I've seen too many cases of api users that are overfetchting for no reason. I don't mind providing a bulk api, but that is a very different use case that regular endpoints shouldn't have to support.

cunac5y ago

my intent with pagination is always to prevent problems with open ended size of data set. 1000 at once is usually not a problem but 10x , 100x etc.. is a big problem for transferring over the wire.

sb82445y ago

> If they wanted to make it easy to get all the results

Speaking from experience...we want to make it easy but also want to keep it performant. Getting the data all in one go is generally not performant and is easy to abuse as an API consumer. For example, always asking for all of the data rather than maintaining a cursor and secondary index (which is so much more performant for everyone involved).

alexchamberlain5y ago

We provide (internal) access to data where we provide interactive access via GraphQL-based APIs and bulk access via CSV or RDF dumps - I feel like dump files are grossly undervalued these days.

sb82445y ago

I agree. I am going to reflect on this and see if there's a way to support dump files long term in our app. We sorta support it today but it's ad hoc implementation since an export can range from a few hundred of a thing to tens of millions of a thing.

Is there any good literature or patterns on supporting dumps in the tens of millions or larger?

I wrote a sheets plug-in that uses our cursor API to provide a full dump into a spreadsheet. Our professional services team is in love with it, so I bet they'd love generic data export capability.

1 more reply

tshaddox5y ago

That’s the point. Running multiple paginated queries in parallel is essentially circumventing the API provider’s intent to limit the number of items requested at one time.

eyelidlessness5y ago

If your API doesn’t support a single client making ~6 concurrent requests, let me tell you about browsers since the last millennium.

1 more reply

felixhuttmann5y ago

A few thoughts:

1) AWS dynamodb has a parallel scanning functionality for this exact use case. https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

2) A typical database already internally maintains an approximately balanced b-tree for every index. Therefore, it should in principal be cheap for the database to return a list of keys that approximately divide the keyrange into N similarly large ranges, even if the key distribution is very uneven. Is somebody aware of a way where this information could be obtained in a query in e.g. postgres?

3) The term 'cursor pagination' is sometimes used for different things, either referring to an in-database concept of cursor, or sometimes as an opaque pagination token. Therefore, for the concept described in the article, I have come to prefer the term keyset pagination, as described in https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin.... The term keyset pagination makes it clear that we are paginating using conditions on a set of columns that form a unique key for the table.

ComodoHacker5y ago

>a way where this information could be obtained in a query

There's no standard way because index implementation details are hidden for a reason.

>in e.g. postgres

You can query pg_stats view (histogram_bounds column in particular) after statistics are collected.

adontz5y ago

I believe data export and/or backup should be a separate API, which is low priority and ensures consistency.

Here we just see regular APIs are being abused for data export. I'm rather surprised the author did not face rate limiting.

eyelidlessness5y ago

Coming from a REST perspective, I wouldn’t implement a separate API, I would use HTTP semantics (eg headers or, if truly necessary query params) on the resource listing to indicate the export/sync intention. Likely with an Accept header. If pagination is still preferred/required, the service could return an ETag or some other continuation token which when provided in subsequent responses could be used to indicate the consistent snapshot being requested. Since this is entirely optional, clients could use this mechanism to opt into stable/parallelizable requests (as I described in less specificity in another sub thread).

At this point, it these requests are expensive you have an opportunity to use a very simple (and optimistic) cache for good faith API users, relegate rate limiting to prevent abuse of cache creation (which should be even easier to detect than just overzealous parallelism), and even use the same or similar semantics to implement deltas for subsequent export/sync.

adontz5y ago

I hardly imagine consistent integral paginated data view without creating a snapshot. I would be manual MVCC implementation or something. Separate API seems a much simpler solution to me.

eyelidlessness5y ago

I think keeping temporal history and restricting paginated results to the data at the point in time where the first page was retrieved would be a pretty decent way to solve offset based interfaces (regardless of the complexity of making the query implementation efficient). Data with a lot of churn could churn on, but clients would see a consistent view until they return to the point of entry.

Obviously this has some potential caveats if that churn is also likely to quickly invalidate data, or revoke sensitive information. Time limits for historical data retrieval can be imposed to help mitigate this. And individual records can be revised (eg with bitemporal modeling) without altering the set of referenced records.

ako5y ago

I think for most use cases, as a user i'd rather see the newest items in a list, then consistency of pagination. If i forget to manually refresh, i might miss out on important new items.

Why do you think it is important for users to have temporal consistency?

eyelidlessness5y ago

Well I’ll use a recent example I encountered that was actually very frustrating. I was looking for a font to use for a logo for a personal project. The site I was using (won’t name and shame, and I can’t recall the site now anyway) had no sorting options, items were ordered by whatever “popularity” formula they use. As I paginated, many of the fonts I’d previously viewed would appear on subsequent pages, often in a different order. It was frustrating not just because I could tell that I was probably missing fonts that were being bumped up to previous pages, but also because it made me doubt my mental model of my own browsing history: “Did I navigate back too far? Did I forget a tangential click and end up on a different search path?”

It’s not a great UX. And in some ways I suspect that my own views were at least partially causing it, which made me more hesitant to even click on anything unless I was sure it was worth the disruption.

ako5y ago

This doesn't sound like a "temporal consistency" problem, rather an inconsistent and untransparent ordering issue.

1 more reply

ppeetteerr5y ago

Pagination of an immutable collection is one thing and can be parallelized. Pagination of a mutable collection (e.g. a database table), on the other hand, is risky since two requests might return intersecting data if new data was added between the requests being executed.

True result sets require relative page tokens and a synchronization mechanism if the software demands it.

simonw5y ago

Intersecting data is fine provided there's a unique ID for each result that can be used to de-duplicate them.

Ideally I'd want a system that guarantees at-least-once delivery of every item. I can handle duplicates just fine, what I want to avoid is an item being missed out entirely due to the way I break up the data.

ppeetteerr5y ago

It's more than just de-duplicating, tho. Imagine you query a dataset and get something like a page count and a chunk size. That page count cannot be trusted if the dataset is mutable. If an item is inserted at the beginning of the set, you're going to miss the last item.

Pagination is hard

the_arun5y ago

For dynamic usecase, DynamoDB has implemented pagination with something called lastEvaluatedKey - https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

This is different from LIMIT in RDBMS

Wouldn’t this pattern solve the complexity you are talking about?

1 more reply

jasonhansel5y ago

It's important here that "created" is an immutable attribute. Otherwise you could get issues where the same item appears on multiple lists (or doesn't appear at all) because its attributes changed during the scanning process.

arcbyte5y ago

I think you could accomplish something similar with token pagination by requesting a number of items that will result in multiple "pages" for your user interface. Then as the user iterates through you can request additional items. This isn't parallelizing, but provides the same low-latency user experience.

gigatexal5y ago

From the code sample in the article I didn’t know you could append to a slice from within a go func

mssundaram5y ago

As long as you use the mutex locks

gigatexal5y ago

Of course. I see that now it’s so obvious not sure why I didn’t see that earlier.

draw_down5y ago

> it uses offsets for pagination... understood to be bad practice by today’s standards. Although convenient to use, offsets are difficult to keep performant in the backend

This is funny. Using offsets is known to be bad practice because.... it’s hard to do.

Look I’m just a UI guy so what do I know. But this argument gets old because I’m sorry, but people want a paginated list and to know how many pages are in the list. Clicking “next page” 10 times instead of clicking to page 10 is bullshit, and users know it.

yxhuvud5y ago

No, what is bullshit is having the option to go to page 10 in the first place. If the user does that then the UI is already broken. What is needed is good filter abilities.

1 more reply

juancn5y ago

The hardest part is to be able to parallelize the sorts in the backend and keep the working sets reasonably sized.

If you ask for "first 50 items after key X", you just need to keep a priority queue of size 50 on each BE node and merge them before returning them (I'm assuming a distributed backend). It doesn't matter on which page you are.

But if you specify "first 50 items after element N" it gets really tricky, each BE shard needs to sort the first N elements, and it can use some trickery to avoid doing a naive merge (see: https://engineering.medallia.com/blog/posts/sorting-and-pagi... ).

You can at most save some transfer over the network.

Rafert5y ago

How do people even figure out they need to be on page 10? Sounds like a filter of some kind (e.g. before a date) would be better, except they trained themselves to use x amount of pages to approximate.

Either way, 10 pages isn't so bad but tens of thousands can become troublesome as explained on https://shopify.engineering/pagination-relative-cursors

PurplePanda5y ago

Maybe they've been there before and happen to remember "page 10".

Maybe someone else has been there before and told them "Go to page ten".

Maybe you know that there are 20 pages, and are looking to find the median, or are just informally playing around, looking at the distribution.

Same as you'd do with an interesting page of a book. I don't think I'd stop using page numbers in dead tree books if they somehow came with filters.

alexchamberlain5y ago

I can't find one right now, but I feel like there must be an algorithm that can identify the key that can be used for each page, yet cheaper than a full sort (ie cheaper than offset pagination in generality). Of course, a skip list would work for a secondary index.

juancn5y ago

This is the best one I've been able to come up with: https://engineering.medallia.com/blog/posts/sorting-and-pagi...

j / k navigate · click thread line to collapse

51 comments

gampleman5y ago

eyelidlessness5y ago

Rate limiting and pagination aren’t (necessarily) about making full data consumption more difficult. They’re more often about optimizing common use cases and general quality of service.

corty5y ago

jerf5y ago

Groxx5y ago

Pagination for a one-page-query is rarely the same cost in my experience, in real-world scenarios.

yxhuvud5y ago

cunac5y ago

my intent with pagination is always to prevent problems with open ended size of data set. 1000 at once is usually not a problem but 10x , 100x etc.. is a big problem for transferring over the wire.

sb82445y ago

> If they wanted to make it easy to get all the results

alexchamberlain5y ago

We provide (internal) access to data where we provide interactive access via GraphQL-based APIs and bulk access via CSV or RDF dumps - I feel like dump files are grossly undervalued these days.

sb82445y ago

Is there any good literature or patterns on supporting dumps in the tens of millions or larger?

I wrote a sheets plug-in that uses our cursor API to provide a full dump into a spreadsheet. Our professional services team is in love with it, so I bet they'd love generic data export capability.

1 more reply

tshaddox5y ago

That’s the point. Running multiple paginated queries in parallel is essentially circumventing the API provider’s intent to limit the number of items requested at one time.

eyelidlessness5y ago

If your API doesn’t support a single client making ~6 concurrent requests, let me tell you about browsers since the last millennium.

1 more reply

felixhuttmann5y ago

A few thoughts:

1) AWS dynamodb has a parallel scanning functionality for this exact use case. https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

ComodoHacker5y ago

>a way where this information could be obtained in a query

There's no standard way because index implementation details are hidden for a reason.

>in e.g. postgres

You can query pg_stats view (histogram_bounds column in particular) after statistics are collected.

adontz5y ago

I believe data export and/or backup should be a separate API, which is low priority and ensures consistency.

Here we just see regular APIs are being abused for data export. I'm rather surprised the author did not face rate limiting.

eyelidlessness5y ago

adontz5y ago

I hardly imagine consistent integral paginated data view without creating a snapshot. I would be manual MVCC implementation or something. Separate API seems a much simpler solution to me.

eyelidlessness5y ago

ako5y ago

I think for most use cases, as a user i'd rather see the newest items in a list, then consistency of pagination. If i forget to manually refresh, i might miss out on important new items.

Why do you think it is important for users to have temporal consistency?

eyelidlessness5y ago

ako5y ago

This doesn't sound like a "temporal consistency" problem, rather an inconsistent and untransparent ordering issue.

1 more reply

ppeetteerr5y ago

True result sets require relative page tokens and a synchronization mechanism if the software demands it.

simonw5y ago

Intersecting data is fine provided there's a unique ID for each result that can be used to de-duplicate them.

ppeetteerr5y ago

Pagination is hard

the_arun5y ago

For dynamic usecase, DynamoDB has implemented pagination with something called lastEvaluatedKey - https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

This is different from LIMIT in RDBMS

Wouldn’t this pattern solve the complexity you are talking about?

1 more reply

jasonhansel5y ago

arcbyte5y ago

gigatexal5y ago

From the code sample in the article I didn’t know you could append to a slice from within a go func

mssundaram5y ago

As long as you use the mutex locks

gigatexal5y ago

Of course. I see that now it’s so obvious not sure why I didn’t see that earlier.

draw_down5y ago

> it uses offsets for pagination... understood to be bad practice by today’s standards. Although convenient to use, offsets are difficult to keep performant in the backend

This is funny. Using offsets is known to be bad practice because.... it’s hard to do.

yxhuvud5y ago

No, what is bullshit is having the option to go to page 10 in the first place. If the user does that then the UI is already broken. What is needed is good filter abilities.

1 more reply

juancn5y ago

The hardest part is to be able to parallelize the sorts in the backend and keep the working sets reasonably sized.

You can at most save some transfer over the network.

Rafert5y ago

Either way, 10 pages isn't so bad but tens of thousands can become troublesome as explained on https://shopify.engineering/pagination-relative-cursors

PurplePanda5y ago

Maybe they've been there before and happen to remember "page 10".

Maybe someone else has been there before and told them "Go to page ten".

Maybe you know that there are 20 pages, and are looking to find the median, or are just informally playing around, looking at the distribution.

Same as you'd do with an interesting page of a book. I don't think I'd stop using page numbers in dead tree books if they somehow came with filters.

alexchamberlain5y ago

juancn5y ago

This is the best one I've been able to come up with: https://engineering.medallia.com/blog/posts/sorting-and-pagi...

j / k navigate · click thread line to collapse