Permanent identifiers should not carry data. This is like the cardinal sin of data management. You always run into situations where the thing you thought, "surely this never changes, so it's safe to squeeze into the ID to save a lookup". Then people suddenly find out they have a new gender identity, and they need a last final digit in their ID numbers too.
Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"
Librarians in the analog age can be forgiven for cramming data into their identifiers, to save a lookup. When the lookup is in a physical card catalog, that's somewhat understandable (although you bet they could run into trouble over it too). But when you have a powerful database at your fingertips, use it! Don't make decisions you will regret just to shave off a couple of milliseconds!
To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.
But in case, the intention for encoding a timestamp into a UUID isn't for any implied meaning. It's both to guarantee uniqueness with a side effect that IDs are more or less monotonically increasing. Whether this is actually desirable depends on your application, but generally if the application is as a indexed key for insertion into a database, it's usually more useful for performance than a fully random ID as it avoids rewriting lots of leaf-nodes of B-trees. If you insert a load of these such keys, it forms a cluster on one side of the tree that can the rebalance with only the top levels needing to be rewritten.
You still have that problem from organic birthdays and also the problem of needing to change ids to correct birth dates.
Maybe the answer is to evenly spread the defaults over 365 days.
well, till you run out of numbers for the immigrants that don't have exact birth date
YY.MM.DD-AAA.BB
In either the AAA or BB component there is something about the gender.
But it does mean that there is a limit of people born per day of a certain gender.
But for a given year, using a moniker will only delay the inevitable. Sure, there are more numbers, but still limited as there are SOME parts that need to reflect reality. Year, gender (if that's still the case?) etc.
> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits.
You can already feel the disaster rising because sone program expects always the latter.
And it doesn’t fix the problem, it just makes it less likely.
I don't agree with the absolute statement, though. Permanent identifiers should not generally carry data. There are situations where you want to have a way to reconciliate, you have space or speed constraints, so you may accept the trade off, md5 your data and store it in a primary index as a UUID. Your index will fragment and thus you will vacuum, but life will still be good overall.
Random vs time biased uuids are not a decision to shave off ms that you will regret.
Most likely they will be a decision that shaves off seconds (yes, really - especially when you consider locality effects) and you'll regret nothing.
You can choose to never make use of that property. But it's tempting.
I totally used uuidv7s as "inserted at" in a small project and I had methods to find records created between two timestamps that literally converted timestamps to uuidv7 values so I could do "WHERE id BETWEEN a AND b"
Hyrum's Law suggests that someone will.
So, random library: https://pkg.go.dev/github.com/google/uuid#UUID.Time
> Time returns the time in 100s of nanoseconds since 15 Oct 1582 encoded in uuid. The time is only defined for version 1, 2, 6 and 7 UUIDs.
What? The first 48 bits of an UUID7 are a UNIX timestamp.
Whether or not this is a meaningful problem or a benefit to any particular use of UUIDs requires thinking about it; in some cases it’s not to be taken lightly and in others it doesn’t matter at all.
I see what you’re getting at, that ignoring the timestamp aspect makes them “just better UUIDs,” but this ignores security implications and the temptation to partition by high bits (timestamp).
Specifically, if your database is small, the performance impact is probably not very noticeable. And if your database is large (eg. to the extent primary keys can't fit within 32-bit int), then you're actually going to have to think about sharding and making the system more distributed... and that's where UUID works better than auto-incrementing ints.
* I think there is a wide range in the middle where your database can fit on one machine if you do it well, but it's worth optimizing to use a cheaper machine and/or extend the time until you need to switch to a distributed db. You might hit this middle range soon enough (and/or it might be a painful enough transition) that it's worth thinking about it ahead of time.
* If/when you do switch to a distributed database, you don't always need to rekey everything:
** You can spread existing keys across shards via hashing on lookup or reversing bits. Some databases (e.g. DynamoDB) actually force this.
** Allocating new ids in the old way could be a big problem, but there are ways out. You might be able to switch allocation schemes entirely without clients noticing if your external keys are sufficiently opaque. If you went with UUIDv7 (which addresses some but not all of the article's points), you can just keep using it. If you want to keep using dense(-ish), (mostly-)sequential bigints, you can amortize the latency by reserving blocks at a time.
The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
The identifier is still connected to the user's data, just through the appropriate other fields in the table as opposed to embedded into the identifier itself.
> So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.
Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.
You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.
> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
A fully random primary key takes into account that things change - since it's not embedding any real-world information. That said I also don't think there's much issue with embedding creation time in the UUID for performance reasons, as the article is suggesting.
I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.
E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.
It's much cleaner and easier to adapt if each person gets an internal context-less identifier and you use their phone number to convert from their external ID/phone number to an internal ID. The old account still has an identifier, there's just no external identifier that translates to it. Likewise if you have to change your identifier scheme, you can have multiple external IDs that translate to the same internal ID (i.e. you can resolve both their old ID and their new ID to the same internal ID without insanity in the schema).
The surrogate key's purpose isn't to directly store the natural key's information, rather, it's to provide an index to it.
> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
There isn't 'another' - there's just one. The surrogate key. The other pieces of information you're describing are not the means of indexing the data. They are the pieces of data you wish to retrieve.
You need it. Because it's maybe one lone unchangeable thing. Taking person for example: * date of birth can be changed, if there was error and correction in documents * any and near all of existing physical characteristics can change over time, either due to brain things (deciding to change gender), aging, or accidents (fingerprints no longer apply if you burnt your skin enough) * DNA might be good enough, but that's one fucking long identifier to share and one hard to validate in field.
So an unique ID attached to few other parts to identify current iteration of individual is the best we have, and the best we will get.
I think IDs should not carry information. Yes, that also means I think UUIDv7 was wrong to squeeze a creation date into their ID.
Isn't that clear enough?
If you take the gender example, for 99% of people, it is male/female and it won't change, and you can use that for load balancing. But if later, you found out that the gender is not the one you expect for that bucket, no big deal, it will cause a branch misprediction, but instead of happening 50% of the times when you use a random value, it will only happen 1% of the times, significant speedup with no loss in functionality.
As long as you're not in China or India around specific years ...
GP's point stands strong.
> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits
So presumably the format is DDMMYYXXXXX (for some arbitrary number of X's), where the XXX represents e.g. an automatically incrementing number of some kind?
Which means that if it's DDMMYYXXX then you can only have 1000 people born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000 people born on DDMMYY.
So in order for there to be so many such entries in common that people are denied use of their actual birthday, then one of the following must be true:
1. The XXX counter must be extremely small, in order for it to run out as a result of people 'using up' those Jan 1 dates each year
2. The number of people born on Jan 1 or immigrating to Norway without knowledge of their birthday must be colossal
If it was just DDMMXXXXX (no year) then I can see how this system would fall apart rapidly, but when you're dealing with specifically "people born on Jan 1 2014 or who immigrated to Norway and didn't know their birthday and were born on/around 2014 so that was the year chosen" I'm not sure how that becomes a sufficiently large number to cause these issues. Perhaps this only occurs in specific years where huge numbers of poorly-documented refugees are accepted?
(Happy to be educated, as I must be missing something here)
I think it also is important to remember the purpose of specific numbers. For instance I would argue a PN without the birthday would be strictly worse. With the current system (I only know the Swedish one, but assume it's the same) I only have to remember a 4 digit (because the number is bdate + unique 4 digits). If we would instead use completely random numbers I would have to remember at least an 8 digit number (and likely to be future proof you'd want at least 9 digits). Sure that's fine for myself (although I suspect some people already struggle with it), but then I also have to remember the numbers for my 2 kids and my partner and things become quickly annoying. Especially, because one doesn't use the numbers often enough that it becomes easy, but still often enough that it becomes annoying to look up, especially when one doesn't always cary their phone with them.
Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.
> A prime example of premature optimization.
Disagree. Data is sticky, and PKs especially so. Moreover, if you’re going to spend time optimizing anything early on, it should be your data model.
> Don't make decisions you will regret just to shave off a couple of milliseconds!
A bad PK in some databases (InnoDB engine, SQL Server if clustered) can cause query times to go from sub-msec to tens of msec quite easily, especially with cloud solutions where storage isn’t node-local. I don’t just mean a UUID; a BIGINT PK on a 1:M can destroy your latency for the simple reason of needing to fetch a separate page for every record. If instead the PK is a composite of (<linked_id>, id) - e.g. (user_id, id) - where id is a monotonic integer, you’ll have WAY better data locality.
Postgres suffers a different but similar problem with its visibility map lookups.
* integer keys are faster;
* uuidv7 keys are faster;
* if you want obfuscated keys, using integer and do some your own obfuscation (!!!).
I can get on-board of uuidv7 (with the trade-off, of course, on stronger guessability). The integer keys argument is strange. At that point, you need to come up with a custom-built system to avoid id collision in a distribution system and tries to achieve only 2x saving (the absolute minimal you should do is 64-bit keys). Very puzzling suggestion and to me very wrong.
Note that in this entire article, the recommendation is not about using natural keys (email address, some composite of user identification etc.), so I am skipping that whole discussion.
I am not a cryptographer, but I would want his recommendation reviewed by a cryptographer. And then I would have to implement it. UUIDs have been extensively reviewed by cryptographers, I have a variety of excellent implementations I can use, I know they solve the problem well. I know they can cause performance issues; they're a security feature that is easy to implement, and I can deal with the performance issues if and when they crop up. (Which, in my experience, it's unusual. Even at a large company, most databases I encounter do not have enough data. I will err on the side of security until it becomes a problem, which is a good problem to have.)
Like, it's a valid complaint.. just not for discussion at hand.
Also, we do live in reality and while having entirely random one might be perfect from theory of data, in reality having it be prefixed by date have many advantages performance wise.
> Permanent identifiers should not carry data. This is like the cardinal sin of data management
As long as you don't use the data and have actual fields for what's also encoded in UUID, there is absolutely nothing wrong with it, provided there is enough of the random part to get around artifacts in real life data.
Same with Austrian social security numbers, which, in somes cases, don't contain the persons birth date and in some cases don't contain any existing date at all.
Yet many websites enforce a valid date and pull the persons birthdate from it...
I guess that Norway has solved it in the same or similar way as Sweden? So a person is identified by the PNR and for those systems that need to track a person over several PNR (government agencies) use PRI. And a PRI is just the first PNR assigned to a person with a 1 inserted in the middle. If that PRI is occupied, use a 2,and so on.
PRI could of course have been a UUID instead.
Someone should have told Julius Caesar and Gregory XIII that :-p
I think you're attacking a straw man. The article doesn't say "instead of UUIDv4 primary keys, use keys such as birthdays with exposed semantic meaning". On the contrary, they have a section about how to use sequence numbers internally but obfuscated keys externally. (Although I agree with dfox's and formerly_proven's comments [1, 2] that XOR method they proposed for this is terrible. Reuse of a one-time pad is probably the most basic textbook example of bad cryptography. They referred to the values as "obfuscated" so they probably know this. They should have just gone with a better method instead.)
Do you have the same criticism for serial identifiers? How about hashes? What about the version field in UUIDs?
It's workload-specific, too. If you want to list ranges of them by PK, then of course random isn't going to work. But then you've got competing tensions: listing a range wants the things you list to be on the same shard, but focusing a workload on one shard undermines horizontal scale. So you've got to decide what you care about (or do something more elaborate).
UUIDv7 is very useful for these scenarios since
A: A hash or modulus of the key will be practically random due to the lower bits being random or pseudo-random (ie distributes well between nodes)
B: the first bits are sortable.. thus the underlying storage on each node won't go bananas.
Why?
This is just another case of keys containing information and is not smart.
The obvious solution is to have a field that drives distribution, allowing rebalancing or whatever.
As a consumer of these databases we're stuck with them as designed, which means we have to worry about key distribution.
Now this doesn't work if you actually have enough data that the randomness of the UUIDv4 keys is a practical database performance issue, but I think you really have to think long and hard about every single use of identifiers in your application before concluding that v7 is the solution. Maybe v7 works well for some things (e.g identifiers for resources where creation times are visible to all with access to the resource) but not others (such as users or orgs which are publicly visible but without publicly visible creation times).
I've read people suggest using a UUIDv7 as the primary key and a UUIDv4 as a user-visible one as a remedy.
My first thought when reading the suggestion was, "well but you'll still need an index on the v4 IDs, so what does this actually get you?" But the answer is that it makes joins less expensive; you only require the index once, when constructing the query from the user-supplied data, and everything else operates with the better-for-performance v7 IDs.
To be clear, in a practical sense, this is a bit of a micro-optimization; as far as I understand it, this really only helps you by improving the data locality of temporally-related items. So, for example, if you had an "order items" table, containing rows of a bunch of items in an order, it would speed up retrieval times because you wouldn't need to do as many index traversals to access all of the items in a particular order. But on, say, a users table (where you're unlikely to be querying for two different users who happen to have been created at approximately the same time), it's not going to help you much. Of course the exact same critique is applicable to integer IDs in those situations.
Although, come to think of it, another advantage of a user-visible v4 with v7 Pk is that you could use a different index type on the v4 ID. Specifically, I would think that a hash index for the user-visible v4 might be a halfway-decent way to go.
I'm still not sure either way if I like the idea, but it's certainly not the craziest thing I've ever heard.
See perhaps "UUIDv47 — UUIDv7-in / UUIDv4-out (SipHash‑masked timestamp)":
* https://github.com/stateless-me/uuidv47
* Sept 2025: https://news.ycombinator.com/item?id=45275973
(What do you think Youtube video IDs are?)
I shared this article a few weeks ago, discussing the problems with this kind of approach: https://notnotp.com/notes/do-not-encrypt-ids/
I believe it can make sense in some situations, but do you really want to implement such crypto-related complexity?
I actually haven no idea. What are they?
(Also what is the format of their `si=...` thing?)
Yes of course everyone should check and unit test that every object is owned by the user or account loading it, but demanding more sophistication from an attacker than taking "/my_things/23" and loading "/my_things/24" is a big win.
In our case, we don't want database IDs in an API and in URLs. When IDs are sequential, it enables things like dictionary attacks and provides estimates about how many customers we have.
Encrypting a database ID makes it very obvious when someone is trying to scan, because the UUID won't decrypt. We don't even need a database round trip.
* How do you manage the key for encrypting IDs? Injected to app environment via envvar? Just embedded in source code? I ask this because I'm curious as to how much "care" I should be putting in into managing the secret material if I were to adopt this scheme.
* Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.
The same way we manage all other secrets in the application. (Summarized below)
> Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.
I don't have the source handy at the moment. It's one of the easier to use symmetric algorithms available in .Net. We aren't talking military-grade security here. In general: a 32-bit int encrypts to 64-bits, so we pad it with a few unicode characters so it's 64-bits encrypted to 128 bits.
---
As far as managing secrets in the application: We have a homegrown configuration file generator that's adapted to our needs. It generates both the configuration files, and strongly-typed classes to read the files. All configuration values are loaded at startup, so we don't have to worry about runtime errors from missing configuration values.
Secrets (connection strings, encryption keys, ect,) are encrypted in the configuration file as base64 strings. The certificate to read/write secrets are stored in Azure Keyvault.
The startup logic in all applications is something like:
1: Determine the environment (production, qa, dev)
2: Get the appropriate certificate
3: Read the configuration files, including decrypting secrets (such as the primary key encryption keys) from the configuration files
4: Populate the strongly-typed objects that hold the configuration values
5: These objects are dependency-injected to runtime objects
We're not worried about key compromises.
If the key is lost, we have much bigger problems.
> Random values don’t have natural sorting like integers or lexicographic (dictionary) sorting like character strings. UUID v4s do have "byte ordering," but this has no useful meaning for how they’re accessed.
Might the author mean that random values are not sequential, so ordering them is inefficient? Of course random values can be ordered - and ordering by what he calls "byte ordering" is exactly how all integer ordering is done. And naive string ordering too, like we would do in the days before Unicode.But the author does not say timestamp ordering, he says ordering. I think he actually means and believes that there is some problem ordering UUIDv4.
Just to complement this with a point, but there isn't any mainstream database management system out there that is distributed on the sense that it requires UUIDs to generate its internal keys.
There exist some you can find on the internet, and some institutions have internal systems that behave this way. But as a near universal rule, the thing people know as a "database" isn't distributed on this sense, and if the column creation is done inside the database, you don't need them.
Auto-incrementing keys can work, but what happens when you run out of integers? Also, distributed dbs probably make this hard, and they can't generate a key on client.
There must be something in Postgres that wants to store the records in PK order, which while could be an okay default, I'm pretty sure you can this behavior, as this isn't great for write-heavy workloads.
Accessing data in totally random locations can be a performance issue.
Depends on lots of things ofc but this is the concern when people talk about UUID for primary keys being an issue.
Values of the same type can be sorted if a order is defined on the type.
It's also strange to contrast "random values" with "integers". You can generate random integers, and they have a "sorting" (depending on what that means though)
The ability to rapidly shard everything can be extremely valuable. The difference between "we can shard on a dime" and "sharding will take a bunch of careful work" can be expensive If the company has poor margins, this can be the difference between "can scale easily" and "we're not getting this investment".
I would argue that if your folks have the technical chops to be able to shard while avoiding surrogate guaranteed unique keys, great. But if they don't.... a UUID on every table can be a massive get-out-of-jail free card and for many companies this is much, much important than some minor space and time optimizations on the DB.
Worth thinking about.
Definitely an issue, rarely the main one. You can work around integer PKs with composite keys or offset-based sharding schemes. What you can't easily fix is a schema with cross-tenant foreign keys, shared lookup tables, or a tenancy model that wasn't designed for data isolation from day one. Those are architectural decisions that require months of migration work.
UUIDs buy you flexibility, sure. But if your data model assumes everything lives in one database, the PK type is a sub-probem of your problems.
Integer PKs were seen as fine for years - decades, even - before the rise of UUIDs.
I'm sure every dba has a war story that starts with similar decision in the past
If you're using latest version of PG, there is a plugin for it.
That's it.
"Recommendation: Stick with sequences, integers, and big integers"
After that then, yes, UUIDv7 over UUIDv4.
This article is a little older. PostgreSQL didn't have native support so, yeah, you needed an extension. Today, PostgreSQL 18 is released with UUIDv7 support... so the extension isn't necessary, though the extension does make the claim:
"[!NOTE] As of Postgres 18, there is a built in uuidv7() function, however it does not include all of the functionality below."
What those features are and if this extension adds more cruft in PostgreSQL 18 than value, I can't tell. But I expect that the vast majority of users just won't need it any more.
0: https://bsky.app/profile/hugotunius.se/post/3m7wvfokrus2g
1: https://github.com/postgres/postgres/blob/master/src/backend...
I have slightly different goals for my version. I want everything to fit in 128 bits so I'm sacrificing some of the random bits, I'm also making sure the representation inside Postgres is also exactly 128 bits. My initial version ended up using CBOR encoding and being 160 bits.
Mine dedicates 16 bits for the prefix allowing up to 3 characters (a-z alphabet).
id => 123, public_id => 202cb962ac59075b964b07152d234b70
There are many ways to generate the public_id. A simple MD5 with a salt works quite well for extremely low effort.
Add a unique constraint on that column (which also indexes it), and you'll be safe and performant for hundreds of millions of rows!
Why do we developers like to overcomplicate things? ;)
For Aurora MySQL, it just makes it worse either way, since there’s no change buffer.
The ability to know ahead of time what a primary key will be (in lieu of persisting it first, then returning) opened up a whole new world of architecting work in my app. It made a lot of previous awkward things feel natural.
PlanetScale wrote up a really good article on why incrementing primary keys are better for performance when compared to randomly inserted primary keys; when it comes to b-tree performance. https://planetscale.com/blog/btrees-and-database-indexes
It is not just about being hard to guess a valid individual identifier in vacuum. Random (or at least random-ish) values, be they UUIDs or undecorated integers, in this context are also about it being hard to guess one from another, or a selection of others.
Wrt: "it isn't x it is y" form: I'm not an LLM, 'onest guv!
While that is often neat solution, do not do that by simply XORing the numbers with constant. Use a block cipher in ECB mode (If you want the ID to be short then something like NSA's Speck comes handy here as it can be instantiated with 32 or 48 bit block).
And do not even think about using RC4 for that (I've seen that multiple times), because that is completely equivalent to XORing with constant.
[0] https://www.cybertec-postgresql.com/en/unexpected-downsides-...
Are there valid reasons to use UUID (assuming correctly) for primary key? I know systems have incorrectly expose primary key to the public, but assuming that's not the concern. Why use UUID over big-int?
UUIDv7 looks really promising but I'm not likely to redo all of our tables to use it.
Main reason I use it is the German Tank problem: https://en.wikipedia.org/wiki/German_tank_problem
(tl;dr; prevent someone from counting how many records you have in that table)
This way you avoid most of the issues highlighted in this article, without compromising your confidential data.
https://cloud.google.com/blog/products/databases/announcing-...
Aren't people using (big)ints are primary keys, and using UUIDs as logical keys for import/export, solving portability across different machines?
Consider say weather hardware. 5 stations all feeding into a central database. They're all creating rows and uploading them. Using sequential integers for that is unnecessarily complex (if even possible.)
Given the amount of data created on phones and tablets, this affects more situations than first assumed.
It's also very helpful in export / edit / update situations. If I export a subset of the data (let's say to Excel), the user can edit all the other columns and I can safely import the result. With integer they might change the ID field (which would be bad). With uuid they can change it, but I can ignore that row (or the whole file) because what they changed it to will be invalid.
I wrote a CRUD app for document storage. It had user id's and document id's. I wrote a method GetDocumentForUser(docID, userID) that checked permissions for that user and document and returned the document if permitted. I then, stupidly, called that method with GetDocumentForUser(userID, docID), and it took me a good half hour to work out why this never returned anything.
It never returned anything because a valid userID will never be a valid docID. If I had used integers it would have returned documents, and I probably wouldn't have spotted it while testing, and I would have shipped a change that cheerfully handed people other people's documents.
I will put up with a fairly considerable amount of performance hit to avoid having this footgun lurking. And yes, I know there are other ways around this (e.g. types) but those come with their own trade-offs too.
Of course this is not always the case that is bad, for example if you have a lot of relations you can have only one table where you have the UUID field (and thus expensive index), and then the relations could use the more efficient int key for relations (for example you have an user entity with both int and uuid keys, and user attribute references the user with the int key, of course at the expense of a join if you need to retrieve one user attribute when retrieving the user is not needed).
original answer: because if you dont come up with these ints randomly they are sequential which can cause many unwanted situations where people can guess valid IDs and deduce things from that data. See https://en.wikipedia.org/wiki/German_tank_problem
Edit: just saw your edit, sounds like we're on the same page!
One more reason to stay away from microservices, if possible.
And of course we could come up with many ways to generate our own ids and make them unique, but we have the following requirements.
- It needs to be a string (because we allow composing them to 'derive' keys) - A client must be able to create them (not just a server) without risk for collisions - The time order of keys must not be guessable easily (as the id is often leaked via references which could 'betray' not just the existence of a document, but also its relative creation time wrt others). - It should be easy to document how any client can safely generate document ids.
The lookup performance is not really such a big deal for us. Where it is we can do a projection into a more simple format where applicable.
But that also adds complexity.
- If you use uuids as foreign keys to another table, it’s obvious when you screw up a join condition by specifying the wrong indices. With int indices you can easily get plausible looking results because your join will still return a bunch of data
- if you’re debugging and need to search logs, having a simple uuid string is nice for searching
Even worse, some tools generate random strings and then ADD hyphens in them to look like UUID (even thought it's not, as the UUID version byte is filled randomly as well), cannot wrap my head why, e.g:
https://github.com/VictoriaMetrics/VictoriaLogs/blob/v1.41.0...
(in the scientific reporting world this would be the perennial "in mice")
It would be the equivalent of "if you're a middle-aged man" or "you're an American".
P.S. I think some of the considerations may be true for any system that uses B-Tree indexes, but several will be Postgres specific.
If you use UUIDv7, you can partition your table by the key prefix. Then the bulk of your data can be efficiently skipped when applying updates.
Just the other day I delivered significant performance gains to a client by converting ~150 million UUIDv4 PKs to good old BIGINT. They were using a fairly recent version of MariaDB.
I was using 64-bit snowflake pks (timestamp+sequence+random+datacenter+node) previously and made the switch to UUID7 for sortable, user-facing, pks. I'm more than fine letting the DB handle a 128-bit int vs over a 64-bit int if it means not having make sure that the latest version of my snowflake function has made it to every db or that my snowflake server never hiccups, ever.
Most of the data that's going to be keyed with a uuid7 is getting served straight out of Redis anyway.
> Misconceptions: UUIDs are secure
> One misconception about UUIDs is that they’re secure. However, the RFC describes that they shouldn’t be considered secure “capabilities.”
> From RFC 41221 Section 6 Security Considerations:
> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities
This is just wrong, and the citation doesn't support it. You're not guessing a 122-bit long random identifier. What's crazy is that the article, immediately prior to this, even cites the very math involved in showing exactly how unguessable that is.
… the linked citation (to §4.4, which is different from the in-prose citation) is just about how to generate a v4, and completely unrelated to the claim. The prose citation to §6 is about UUIDs generally: the statement "Do not assume that [all] UUIDs are hard to guess" is not logically inconsistent with properly-generated UUIDv4s being hard to guess. A subset of UUIDs have security properties, if the system generating & using them implements those properties, but we should not assume all UUIDs have that property.
Moreover, replacing an unguessable UUID with an (effectively random) 32-bit integer does make it guessable, and the scheme laid out seems completely insecure if it is to be used in the contexts one finds UUIDv4s being an unguessable identifier.
The additional size argument is pretty weak too; at "millions of rows", a UUID column is consuming an additional ~24 MiB.
The UUID is 128bits. The first 64bits are a java long. The last 64bits are a java long. Let's just combine the Tenant ID long with a Resource ID long to generate a unique id for this on our platform. (worked until it didn't).
We have all faced issues where we don’t know where the data will ultimate live that’s optimal for our access patterns.
Or we have devices and services doing async operations that need to sync.
I’m not working on mission critical “if this fails there’s a catastrophic event” type shit. It’s just rent seeking SaaS type shit.
Oh no it cost $0.35 extra to make $100. Next year will make more money relative to cost increase.
Uuid4 are only 224bits is a bs argument. Such a made up problem.
But a fair point is that one should use a sequential uuid to avoid fragmentation. One that has a time part.
- A client used to run our app on-premises and now wants to migrate to the cloud.
- Support engineers want to clone a client’s account into the dev environment to debug issues without corrupting client data.
- A client wants to migrate their account to a different region (from US to EU).
Merging data using UUIDs is very easy because ID collisions are practically impossible. With integer IDs, we'd need complex and error-prone ID-rewriting scripts. UUIDs are extremely useful even when the tables are small, contrary to what the article suggests.
Also is it necessary to show uuid at all to customers of an API? Or could it be a valid pattern to hide all the querying complexity behind named identifiers, even if it could cost a bit in terms of joining and indexing?
The context is the classic B2B SaaS, but feel free to share your experiences even if it comes from other scenarios!
Wrong. Don't use B-Tree for random indexes, there's HASH index exactly for this:
CREATE INDEX [index_name] ON [table_name] USING HASH ([column_name]);The power and main purpose of UUIDs is to act as easy to produce, non-conflicting references in distributed settings. Since the scope of TFA is explicitly set to be "monolithic web apps", nothing stops you from having everything work with bigint PKs internally, and just add the UUIDs where you need to provide external references to rows/objects.
I would like to hear from others using, for example, Google Spanner, do you have issues with UUID. I don't for now, most optimizations happen at the Controller level, data transformation can be slow due to validations. Try to keep service logic as straightforward as possible.
Hash index is ideally suited for UUIDs but for some reason Postgres hash indexes cannot be unique.
So far, people have talked a lot about UUIDs, so I'm genuinely curious about what's in-between.
Sometimes I have to talk to legacy systems, all my APIs have str IDs, and I encode int IDs as just decimal left padded with leading zeros up to 26 chars. Technically not a compliant ULID but practically speaking, if I see leading `00` I know it's not an actual ULID, since that would be before Nov-2004, and ULID was invented in 2017. The ORM automatically strips the zeros and the query just works.
I'm just kind of over using sequential int IDs for anything bigger than hobby level stuff. Testing/fixturing/QA are just so much easier when you do not have to care about whether an ID happens to already exist.
> the impact to inserts and retrieval of individual items or ranges of values from the index.
Classic OLTP vs OLAP.
Yes there are different flavors of generating them with their own pros and cons, but at the end of the day it’s just so much more elegant than some auto incrementing crap your database creates. But that is just semantic, you can always change the uuid algorithm for future keys. And honestly if you treat the uuid as some opaque entity (which you should), why not just pick the random one?
And I just thought of the argument that “but what if you want to sort the uuid…” say it’s used for a list of stories or something? Well, again… if you treat the uuid as opaque why would you sort it? You should be sorting on some other field like the date field or title or something. UUIDs are opaque, damn it. You don’t sort opaque data. “Well they get clustered weird” say people. Why are you clustering on a random opaque key? If you need certain data to be clustered, then do it on the right key (user_id field did your data was to be clustered by user, say)
Letting the client generate the primary keys is really liberating. Not having to care about PK collisions or leaking information via auto incrementing numbers is great!
In my opinion uuid isn’t used enough!
Balanced and uniformly scattered. A random index means fetching a random page for every item. Fine if your access patterns are truly random, but that's rarely the case.
> Why are you clustering on a random opaque key?
InnoDB clusters by the PK if there is one, and that can't be changed (if you don't have a PK, you have some options, but let's assume you have one). MSSQL behaves similarly, but you can override it. If your PK is random, your clustering will be too. In Postgres, you'll just get fragmented indexes, which isn't quite as bad, but still slows down vacuum. Whether that actually becomes a problem is also going to depend on access patterns.
One shouldn't immediately freak out over having a random PK, but should definitely at least be aware of the potential degradation they might cause.
> We need to read blocks from the disk when they are not in the PostgreSQL buffer cache. Conveniently, PostgreSQL makes it very easy to inspect the contents of the buffer cache. This is where the big difference between uuidv4 and uuidv7 becomes clear. Because of the lack of data locality in uuidv4 data, the primary key index is consuming a huge amount of the buffer cache in order to support new data being inserted – and this cache space is no longer available for other indexes and tables, and this significantly slows down the entire workload.
Guess how much I have to defend against attackers trying 538-62-xxxx
And I'd likely want a unique constraint on that?
The issue is that is true for more or less all capability URLs. I wouldn't recommend UUIDs per se here, probably better to just use a random number. I have seen UUIDs for this in practice though and these systems weren't compromised because of that.
I hate the tendency that password recovery flows for example leave the URL valid for 5 minutes. Of course these URLs need to have a limited life time, but mail isn't a real time communication medium. There is very little security benefit from reducing it from 30 minutes to 5 minutes for example. You are not getting "securer" this way.
I'm tired of midwit arguments like "Tech X is N% faster than tech Y at performing operation Z. Since your system (sometimes) performs operation Z, it implies that Tech X is the only logical choice in all situations!"
It's an infuriatingly silly argument because operation Z may only represent about 10% of the total CPU usage of the whole system (averaged out)... So what is promoted as a 50% gain may in fact be a 5% gain when you consider it in the grand scheme of things... Negligible. If everyone was looking at this performance 'advantage' rationally; nobody would think it's worth sacrificing important security or operational properties.
I don't know what happened to our industry; we're supposed to be intelligent people but I see developers falling for these obvious logical fallacies over and over.
I remember back in my day, one of the senior engineers was discussing upgrading a python system and stated openly that the new version of the engine was something like 40% slower than the old version but he didn't even have to explain himself why upgrading was still a good decision; everybody in the company knew he was only talking about the code execution speed and everybody knew that this was a small fraction of the total.
Not saying UUIDv7 was a bad choice for Postgres. I'm sure it's fine for a lot of situations but you don't have to start a cult preaching the gospel of The One True UUID to justify your favorite project's decisions.
I do find it kind of sly though how the community decided to make this UUIDv7 instead of creating a new standard for it.
The whole point of UUID was to leverage the properties of randomness to generate unique IDs without requiring coordination. UUIDv7 seems to take things in a philosophically different path. People chose UUID for scalability and simplicity (both of which you get as a result of doing away with the coordination overhead), not for raw performance...
That's the other thing which drives me nuts; people who don't understand the difference between performance and scalability. People foolishly equate scalability with parallelism or concurrency; whereas that's just one aspect of it; scalability is a much broader topic. It's the difference between a theoretical system which is fast given a certain artificially small input size and one which actually performs better as the input size grows.
Lastly; no mention is made about the complex logic which has to take place behind the scenes to generate UUIDv7 IDs... People take it for granted that all computers have a clock which can produce accurate timestamps where all computers in the world are magically in-sync... UUIDv7 is not simple; it's very complicated. It has a lot of additional complexity and dependencies compared to UUIDv4. Just because that complexity is very well hidden from most developers, doesn't mean it's not there and that it's not a dependency... This may become especially obvious as we move to a world of robotics and embedded systems where cheap microchips may not have enough Flash memory to hold the code for the kinds of programs required to compute such elaborate IDs.