This is also sometimes known as keyset pagination. My favourite technical explanation of that is here: https://use-the-index-luke.com/no-offset
Possibly just an implementation error in the BlockJoinParser but it did not occur with numeric pagination.
In order to work with that, you need to “flatten” your json, like with the @JsonUnwrapped annotation, and some structures (like arrays) may become problematic and/or require significant lexical mapping of queries to the dataset.
No, cursor is this https://en.wikipedia.org/wiki/Cursor_(databases) https://www.postgresql.org/docs/current/plpgsql-cursors.html
I once did pagination using database cursors, which is something different than keyset pagination: The server would keep a cursor open and keep fetching more data from the same query. This enabled the system to have an interface similar to offset pagination (you get the first page, then the second page, etc) but without doing a new query for each page discarding the first n-1 pages per query
The downside is that it makes the server stateful, and doesn't scale (you would need to keep hundreds of cursors open if you had hundreds of simultaneous users)
You can read more about it here: https://aaronfrancis.com/2022/efficient-pagination-using-def... or here https://planetscale.com/blog/fastpage-faster-offset-paginati....
There are libraries for Laravel (https://github.com/hammerstonedev/fast-paginate) and Rails (https://github.com/planetscale/fast_page) as well!
Cursor based pagination is wonderful, but sometimes you're stuck with offset/limit for whatever reason. Might as well make it fast.
MySQL is fairly predictable, though, so when you understand that it wants to nested-loop join all your rows before evaluating predicates on the parent table, it's a predictable win to stop it doing that.
The technique is still applicable even when you have no joins, because MySQL will materialize rows with every selected column before evaluating the unindexed portion of the predicate, and the order by.
A result you haven’t yet seen can be mutated to sort before your current cursor, likewise a result that you’ve already seen can be mutated to sort after the current cursor, causing you to see it twice.
Cursor based pagination does minimize the issue somewhat, because only the mutated rows are possibly affected, not unrelated records that lie on page boundaries. But depending on the use case I’m not sure that is worth that added complexity of cursor based pagination (it does get a bit tricky once you have any non-trivial sort clause).
Expensive to implement, so this would only work for situations where you can afford to spend significant storage and computation resources to keep a specific user happy.
Personally, if you care about users not missing any item in a query, you just can't use pagination at all, and you have to give them every item in the query in a single huge dump (running the query again would be the "explicit action" mentioned above that gets the user new data). Conversely, if you use pagination, users are free to assume that they might miss some items unless they already expect the underlying data to be immutable.
You probably also need to switch from deleting records to adding an archive bit (or timestamp).
This gets complicated fast.
For instance, if you have 40 entries and specify that there should be 10 items per page
40 entries??? Just show them all to me. My browser has a scrollbar, CTRL-F is much faster than your search box.
"but not everyone has a reliable fast connection" -- yes, which is a good reason to deliver more data per http request than breaking it up and requiring lots of slow requests.
"but the database load" -- half the queries, each returning 2x data, is almost always going to be easier on a RDBMS. If it's not then you probably need to rethink your schema.
You get to see more ads while flipping through pages.
Timing metrics for loading smaller pages make marketing happy.
Timing metrics for time spent on the website make marketing happy.
Because data retrieval costs money and resources. Unbounded API responses are a susceptible to denial-of-service. Well-architected APIs will always cap the number of results they return per-call.
It really depends on what kind of site we're talking about; there's loads of times I want to see loads of results. Sometimes I don't, but there's no harm is showing me more than I want, either.
In the general case this isn’t a good excuse, there’s no reason it should take that long. But you don’t always get to pick your technology.
It’ll be like Jira, pulling in literal megabytes of data and still slow.
HN is pretty minimalistic, but the point is that text data isn't really what makes requests very large. For reference, 2MB is enough to include an entire book and then some. All of Dune is about 1.7M uncompressed in epub format (and ~1.1M compressed as an epub file). HN threads, Jira discussions, etc. can get pretty large, but I have not seen any yet that have started to approach the size of Dune.
Images might add more data, but those can be loaded on-demand when scrolled into view pretty easily.
Indeed.com's search results are almost all text. Maybe a company avatar. No reason it can't provide 100+ job listings per page rather than 10-20.
I agree with OP that if you are on a powerful machine with fast internet it would be better just to load a massive HTML document of list items.
His site is a great resource for anyone wanting to take a deeper dive on SQL performance:
https://use-the-index-luke.com/sql/partial-results/fetch-nex...
https://old.reddit.com/?count=25&after=t3_wtpvdp
I noticed Reddit's pagination has that "after" parameter, which points to the last post on the current page.
It glitches out if the last item is deleted by moderators, but otherwise it works smoothly.
https://en.wikipedia.org/w/index.php?title=Category:Living_p...
No need for row value syntax and it works with MS SQL Server
> As we saw, plain keyset pagination offers no facility to jump a certain percentage into the results except through client guesswork. However the PostgreSQL statistics collector maintains per-column histograms of value distribution. We can use these estimates in conjunction with limits and small offsets to get fast random-access pagination through a hybrid approach.
Unless you can guarantee your data is static, or that the sorting order cannot be mutated and only append later values, the concept of what data belongs in which page could be changing every millisecond.
The name "cursor pagination" is super confusing given that "cursor" is an overloaded term in databases. I always call this "token pagination", given that the APIs I've seen usually call the value a token.
[1] https://gist.github.com/jvolkman/b8c0e3d05929a1506c99fbc9474...
So, anyone here implement pagination via cursors? What do you find to be the drawbacks and how do you mitigate them?
One corner case would be if the cursor record is deleted. I don't see it mentioned how they handle it.
Or is there a trick here I’m missing?
Do you just crash and ask the user to start over?
Do you have to nudge open cursors on every delete?
That way it doesn't matter if the record is deleted - I can still return the next page by showing records that come after that provided cursor value.
There's an example on this page: https://latest.datasette.io/fixtures/sortable?_sort=sortable
Since the table is sorted by the "sortable" column, the next page link includes this:
?_next=15%2Cg%2Cz
15 is the last value for "sortable" on the page, then g,z are the compound primary key for that last row.1. I can set how many items per page.
2. I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around. If an item gets removed, whatever I was looking for should still be in the same vicinity.
3. As a result of #1 and #2, I can go back and find items I saw previously because their placement is at some fairly reliable position on the page I remember it being on or around.
You know, like how a physical book or catalogue with pages works.
Please stop trying to improve on this and making usability more difficult. I hate infinity scroll and I hate not being able to pick my page in an intuitive fashion.
I've had quite a few heated discussions on this point. The problem is, once your dataset gets large enough, this use case is incredibly difficult to scale; not impossible (Google does it with millions of search results), but prohibitively expensive compared to how often the need of being able to jump to any specific page arises.
Now I always try to stick to cursor based pagination as the default in order to prevent people from building workflows on top of offsets.
Google cheats, but in a way that very few users will notice. You can only access the first 20 pages of search results. Depending on user behavior, this is one way to offer navigation via page number while limiting worst-case cost.
While it can be frustrating, I doubt much revenue (relatively speaking) was lost due to this issue, which means for most apps and services, it will likely not get done.
(I'm not disputing the merit of your arguments, just explaining why this will rarely get done well in real life...)
Pagination is incredibly useful to human. If I tell you I found something on page 15 you can relate to it, something I cannot do with infinite scroll.
Not easy to do of course, but it’s one direction you can go.
At scale you might care about the duplicate or up-to-date records. But cursor-based doesn't solve the problem if a results page is left open for "too long" and stuff was added behind your cursor (or after it, if navigating backwards).
It's as if making things less intuitive (to articles' reference to book pages), makes it any easier as long as you don't think about any pitfalls.
My suggestion is to just use pages, and optimise for the right order (I.e.: sequential IDs, or creation date, alphabetical, etc) that make sense for your data.
If you REALLY must care if results have changed, some delta being stored would be best (like some timestamp that allows the server side to indicate "hey, your results are out of date, from 7 days ago, maybe you left that page/API response unused for too long")
Except that no such information appears in the article. Here's the entirety of the explanation:
>> Once you’ve been returned a cursor, you can provide it in your requests to act as a farther-along starting point within your data entries. The server is then able to efficiently skip all entries that come before your specified cursor value
Everything else has big tradeoffs. You can make sure your rows are sortable by a specific column, but that means your searches must all be sorted by that column.
You can save the results of a search and page through it, but that's a lot of I/O.
You can not do any of the above, but then you run the risk of non-deterministic pagination (I went back and because someone added/deleted a row, that page is slightly different now). You also can't really link to it.
These things might all be fine. In my experience though, you run into people who really want to do search in way A, and will use the tradeoffs for way B/C/D to argue against it, as though A has no tradeoffs.
The legacy system we're replacing used cursor pagination in a lot of areas and was perfectly acceptable to them, but now that we're going web based - it's not. Unfortunately, really - it seems vastly superior.
I recommend using elastic search or a nosql database to optimise performance. Relational databases can be slow for this use case.
And the token has enough metadata to reconstruct the query from where it left behind in the previous call and perform additional security checks so that it cannot be used by other customers to get results from a different account.
I have actually migrated databases from Aurora to DynamoDb behind the scenes where customers continued to call these APIs with these tokens and the calls would flip to the new DB and resume from the last place. No downtime or "maintenance window" which would be a non-starter at AWS anyway.
I recommend this for any external service. For internal services, if you work closely enough with your users, may be you can opt for something more transparent.