Avoid UUID Version 4 Primary Keys in Postgres (opens in new tab)

(andyatkinson.com)

377 pointspil0u5mo ago449 comments

449 comments

A prime example of premature optimization.

Permanent identifiers should not carry data. This is like the cardinal sin of data management. You always run into situations where the thing you thought, "surely this never changes, so it's safe to squeeze into the ID to save a lookup". Then people suddenly find out they have a new gender identity, and they need a last final digit in their ID numbers too.

Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"

Librarians in the analog age can be forgiven for cramming data into their identifiers, to save a lookup. When the lookup is in a physical card catalog, that's somewhat understandable (although you bet they could run into trouble over it too). But when you have a powerful database at your fingertips, use it! Don't make decisions you will regret just to shave off a couple of milliseconds!

ralferoo5mo ago

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"

To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.

But in case, the intention for encoding a timestamp into a UUID isn't for any implied meaning. It's both to guarantee uniqueness with a side effect that IDs are more or less monotonically increasing. Whether this is actually desirable depends on your application, but generally if the application is as a indexed key for insertion into a database, it's usually more useful for performance than a fully random ID as it avoids rewriting lots of leaf-nodes of B-trees. If you insert a load of these such keys, it forms a cluster on one side of the tree that can the rebalance with only the top levels needing to be rewritten.

nonethewiser5mo ago

>To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.

You still have that problem from organic birthdays and also the problem of needing to change ids to correct birth dates.

jonny_eh5mo ago

To add to that, birthdays can clump, just like any seemingly "random" data.

2 more replies

tyre5mo ago

And then have to enter/handle a non-date through all systems? How do you know if this non-dated person is over the age of minority? Eligible for a pension?

Maybe the answer is to evenly spread the defaults over 365 days.

ralferoo5mo ago

If you don't know their birthday, you can presumably never answer that question in any case.

If you only know the birth year and keyed 99 as the month for unknown, then your algorithm would determine they were of a correct age on the start of the year after that was true, which I guess would be what you want for legal compliance.

If you don't even know if the birth year is correct, then the correct process depends on policy. Maybe they choose any year, maybe they choose the oldest/youngest year they might be, maybe they just encode that as 0000/9999.

Again, if you don't know the birth year of someone, you would have no way of knowing their age. I'm not sure that means that the general policy of putting a birthday into their ID number is flawed.

Many governments re-issue national IDs to the same person with different numbers, which is far less problematic that the many governments who choose to issue the same national ID (looking at you USA with your SSN) to multiple individuals. It doesn't seem like a massive imposition on a person who was originally issued an ID based on not knowing when their birthday to be re-issued a new ID when their birthday was ascertained. Perhaps even give them a choice of keeping the old one knowing it will cause problems, or take the new one instead and having the responsibility to tell people their number had changed.

Presumably the governments that choose to embed the date into a national ID number do so because it's more useful for their purposes to do so than just assigning everyone a random number.

1 more reply

PunchyHamster5mo ago

> To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.

well, till you run out of numbers for the immigrants that don't have exact birth date

OptionOfT5mo ago

Belgium's national register number is similar:

YY.MM.DD-AAA.BB

In either the AAA or BB component there is something about the gender.

But it does mean that there is a limit of people born per day of a certain gender.

But for a given year, using a moniker will only delay the inevitable. Sure, there are more numbers, but still limited as there are SOME parts that need to reflect reality. Year, gender (if that's still the case?) etc.

hyperman15mo ago

BB is a mod-97 checksum. The first A of AAA encodes your gender in an even/odd fashion, I forgot if its the first or last A doing that. MM or DD can be 00 if unknown. Also MM has +20 or +40 in some cases.

If you know someones birth date and gender, the INSZ is almost certainly 1 in 500 numbers, with a heavy skew to the lower AAA. Luckily, you can't do much damage with someones number,unlike an USA SSN (but I'd still treat it confidential).

1 more reply

croes5mo ago

> If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits.

You can already feel the disaster rising because sone program expects always the latter.

And it doesn’t fix the problem, it just makes it less likely.

tacone5mo ago

Fantastic real life example. Italian PNs carry also the gender, which something you can change surgically, and you'll eventually run into the issue when operating at scale.

I don't agree with the absolute statement, though. Permanent identifiers should not generally carry data. There are situations where you want to have a way to reconciliate, you have space or speed constraints, so you may accept the trade off, md5 your data and store it in a primary index as a UUID. Your index will fragment and thus you will vacuum, but life will still be good overall.

mckirk5mo ago

I'm not sure whether that was intended, but 'operating at scale' actually made me laugh out loud :D

benterix5mo ago

I have to admit an unintended chuckle, too.

cozyman5mo ago

how does one change their gender surgically?

delichon5mo ago

You can't, but since gender isn't defined by anything physical, there's no need.

1 more reply

WalterSlovotsky5mo ago

The preferred method would be gender affirming surgery.

1 more reply

pelheit5mo ago

I think he means attempting to mimic some features of the opposite sex, with procedures like penile inversion and so on.

barrkel5mo ago

Uuid v7 just has a bias in its generation; it isn't carrying information. You're not going to try and extract a timestamp from a uuid.

Random vs time biased uuids are not a decision to shave off ms that you will regret.

Most likely they will be a decision that shaves off seconds (yes, really - especially when you consider locality effects) and you'll regret nothing.

duckerude5mo ago

I've worked on a system where ULIDs (not UUIDv7, but similar) were used with a cursor to fetch data in chronological order and then—surprise!—one day records had to be backdated, meaning that either the IDs for those records had to be counterfeited (potentially violating invariants elsewhere) or the fetching had to be made smarter.

You can choose to never make use of that property. But it's tempting.

voidnap5mo ago

I made a service using something like a 64 bit wide ULID but there was never a presumption that data is be inserted or updated earlier than the most recent record.

If the domain is modeling something like external events (in my case), and that external timestamp is packed into your primary key, and you support receiving events out of chronological order, then it just follows that you might insert stuff ealrier than you latest record.

You're gonna have problems "backdating" if you mix up time of insertion with when the event you model actually ocurred. Like id you treat those as the same thing when they aren't.

tobyhinloopen5mo ago

> You're not going to try and extract a timestamp from a uuid.

I totally used uuidv7s as "inserted at" in a small project and I had methods to find records created between two timestamps that literally converted timestamps to uuidv7 values so I could do "WHERE id BETWEEN a AND b"

jandrewrogers5mo ago

> You're not going to try and extract a timestamp from a uuid.

Hyrum's Law suggests that someone will.

ncruces5mo ago

> You're not going to try and extract a timestamp from a uuid.

So, random library: https://pkg.go.dev/github.com/google/uuid#UUID.Time

> Time returns the time in 100s of nanoseconds since 15 Oct 1582 encoded in uuid. The time is only defined for version 1, 2, 6 and 7 UUIDs.

bri3d5mo ago

> You're not going to try and extract a timestamp from a uuid.

What? The first 48 bits of an UUID7 are a UNIX timestamp.

Whether or not this is a meaningful problem or a benefit to any particular use of UUIDs requires thinking about it; in some cases it’s not to be taken lightly and in others it doesn’t matter at all.

I see what you’re getting at, that ignoring the timestamp aspect makes them “just better UUIDs,” but this ignores security implications and the temptation to partition by high bits (timestamp).

nine_k5mo ago

Nobody forces you to use a real Unix timestamp. BTW the original Unix timestamp is 32 bits (expiring in 2038), and now everyone is switching to 64-bit time_t. What 48 bits?

All you need is a guaranteed non-decreasing 48-bit number. A clock is one way to generate it, but I don't see why a UUIDv7 would become invalid if your clock is biased, runs too fast, too slow, or whatever. I would not count on the first 48 bits being a "real" timestamp.

1 more reply

hnfong5mo ago

The curious thing about the article is that, it's definitely premature optimization for smaller databases, but when the database gets to the scale where these optimizations start to matter, you actually don't want to do what they suggest.

Specifically, if your database is small, the performance impact is probably not very noticeable. And if your database is large (eg. to the extent primary keys can't fit within 32-bit int), then you're actually going to have to think about sharding and making the system more distributed... and that's where UUID works better than auto-incrementing ints.

scottlamb5mo ago

I agree there's a scale below which this (or any) optimization matters and a scale above which you want your primary key to have locality (in terms of which shard/tablet/... is responsible for the record). But...

* I think there is a wide range in the middle where your database can fit on one machine if you do it well, but it's worth optimizing to use a cheaper machine and/or extend the time until you need to switch to a distributed db. You might hit this middle range soon enough (and/or it might be a painful enough transition) that it's worth thinking about it ahead of time.

* If/when you do switch to a distributed database, you don't always need to rekey everything:

** You can spread existing keys across shards via hashing on lookup or reversing bits. Some databases (e.g. DynamoDB) actually force this.

** Allocating new ids in the old way could be a big problem, but there are ways out. You might be able to switch allocation schemes entirely without clients noticing if your external keys are sufficiently opaque. If you went with UUIDv7 (which addresses some but not all of the article's points), you can just keep using it. If you want to keep using dense(-ish), (mostly-)sequential bigints, you can amortize the latency by reserving blocks at a time.

mkleczek5mo ago

This is actually a very deep and interesting topic. Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place. So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.

The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

Ukv5mo ago

> Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place.

The identifier is still connected to the user's data, just through the appropriate other fields in the table as opposed to embedded into the identifier itself.

> So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.

Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.

You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

A fully random primary key takes into account that things change - since it's not embedding any real-world information. That said I also don't think there's much issue with embedding creation time in the UUID for performance reasons, as the article is suggesting.

marcus_holmes5mo ago

> You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.

This is the key point, I think. Searching is not the same as identifying.

mkleczek5mo ago

> Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.

So what is such an identifier for? Is it only for some technical purposes (like replication etc.)?

Why bother with UUID at all then for internal identifiers? Sequence number should be enough.

2 more replies

everforward5mo ago

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.

E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.

It's much cleaner and easier to adapt if each person gets an internal context-less identifier and you use their phone number to convert from their external ID/phone number to an internal ID. The old account still has an identifier, there's just no external identifier that translates to it. Likewise if you have to change your identifier scheme, you can have multiple external IDs that translate to the same internal ID (i.e. you can resolve both their old ID and their new ID to the same internal ID without insanity in the schema).

mkleczek5mo ago

> I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.

If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.

The whole discussion is about externally visible identifiers (ie. identifiers visible to external software, potentially used as a persistent long-term reference to your data).

> E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.

Introducing surrogate keys (regardless of whether UUIDs or anything else) does not solve any problem in reality. When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me. Surrogate keys don't help here at all. You either have to be able to solve this issue in the database or you need to have an oracle (ie. a person) that must decide ad-hoc what piece of data is identified by the information I provided.

The key issue here is that you try to model identifiable "entities" in your data model, while it is much better to model "captured information".

So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z". Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.

2 more replies

brettgriffin5mo ago

The surrogate key's purpose isn't to directly store the natural key's information, rather, it's to provide an index to it.

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

There isn't 'another' - there's just one. The surrogate key. The other pieces of information you're describing are not the means of indexing the data. They are the pieces of data you wish to retrieve.

mkleczek5mo ago

Any piece of information that can be used to retrieve something using this index has to be available "outside" your database - ie. to issue a query "give me piece of information identified by X" you have to know X first. If X is only available in your index then you must have another index to retrieve X based on some externally available piece of information Y. And then X becomes useless as an identifier - it just adds a level of indirection that does not solve any information retrieval problem.

That's my whole point: either X becomes a "real world artifact" or it is useless as identifier.

1 more reply

PunchyHamster5mo ago

Identifier is just "a piece of common token system can use to operate on same entity.

You need it. Because it's maybe one lone unchangeable thing. Taking person for example: * date of birth can be changed, if there was error and correction in documents * any and near all of existing physical characteristics can change over time, either due to brain things (deciding to change gender), aging, or accidents (fingerprints no longer apply if you burnt your skin enough) * DNA might be good enough, but that's one fucking long identifier to share and one hard to validate in field.

So an unique ID attached to few other parts to identify current iteration of individual is the best we have, and the best we will get.

vrighter5mo ago

You can't take into account the fact that things change when you don't know what those changes might be. You might end up needing to either rebuild a new database, have some painful migration, or support two codepaths to work with both types of keys.

mkleczek5mo ago

Network protocol designers know better and by default embed protocol version number in message format spec.

I guess you can assign 3-4 bits for identifier version number as well.

And yes - for long living data dealing with compatibility issues is inevitable so you have to take that into account from the very beginning.

1 more reply

groundzeros20155mo ago

You can’t design something by trying to anticipate all future changes. things will change and break.

In my personal design sense, I have found keeping away generality actually helps my code last longer (based on more concrete ideas) and easier to change when those days come.

1 more reply

hyperpape5mo ago

Your comment is sufficiently generic that it’s impossible to tell what specific part of the article you’re agreeing with, disagreeing with, or expanding upon.

vintermann5mo ago

I disagree that performance should be a reason to choose running numbers over guids until you absolutely have to.

I think IDs should not carry information. Yes, that also means I think UUIDv7 was wrong to squeeze a creation date into their ID.

Isn't that clear enough?

mcny5mo ago

That's the creation date of that guid though. It doesn't say anything about the entity in question. For example, you might be born in 1987 and yet only get a social security number in 2007 for whatever reason.

So, the fact that there is a date in the uuidv7 does not extend any meaning or significance to the record outside of the database. To infer such a relationship where none exists is the error.

3 more replies

hyperpape5mo ago

Those are two unrelated points and the connection between them was unclear in the original post.

hxtk5mo ago

When I think "premature optimization," I think of things like making a tradeoff in favor of performance without justification. It could be a sacrifice of readability by writing uglier but more optimized code that's difficult to understand, or spending time researching the optimal write pattern for a database that I could spend developing other things.

I don't think I should ignore what I already know and intentionally pessimize the first draft in the name of avoiding premature optimization.

barrkel5mo ago

UUID v7 doesn't squeeze creation date in. If you treat it as anything other than a random sequence in your applications, you're just wrong.

1 more reply

anamexis5mo ago

For what it’s worth, it was also completely unclear to me how you were responding to the article itself. It does not discuss natural keys at all.

GuB-425mo ago

I don't think the timestamped UUIDs are "carrying data", it is just a heuristic to improve lookup performance. If the timestamp is wrong, it will just run as slow as the non-timestamped UUID.

If you take the gender example, for 99% of people, it is male/female and it won't change, and you can use that for load balancing. But if later, you found out that the gender is not the one you expect for that bucket, no big deal, it will cause a branch misprediction, but instead of happening 50% of the times when you use a random value, it will only happen 1% of the times, significant speedup with no loss in functionality.

moralestapia5mo ago

>and you can use that for load balancing

As long as you're not in China or India around specific years ...

GP's point stands strong.

delecti5mo ago

As soon as you encode imperfect data in an immutable key, you always have to check when you retrieve it. If that piece of data isn't absolutely 100% guaranteed to be perfect, then you have to query both halves of the load balanced DB anyway.

benterix5mo ago

Your comment is valid but is not related to the article.

spoiler5mo ago

More broadly, this is the ages old surrogate vs natural key discussion, but yes the comment completely misses the point of the article. I can only assume they didn't read it in full!

vintermann5mo ago

The article explicitly argues against the use of GUIDs as primary keys, and I'm arguing for it.

A running number also carries data. Before you know it, someone's relying on the ordering or counting on there not being gaps - or counting the gaps to figure out something they shouldn't.

3 more replies

oncallthrow5mo ago

It sounds to me like you’re just arguing for premature optimization of another kind (specifically, prematurely changing your entire architecture for edge cases that probably won’t ever happen to you).

vintermann5mo ago

If you have an architecture already, obviously it's hard to change and you may want to postpone it until those edge cases which probably won't ever happen to you, happen. But for new architectures, value your own grey hairs over small performance improvements.

danudey5mo ago

Perhaps you can clarify something for me, because I think I'm missing it.

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits

So presumably the format is DDMMYYXXXXX (for some arbitrary number of X's), where the XXX represents e.g. an automatically incrementing number of some kind?

Which means that if it's DDMMYYXXX then you can only have 1000 people born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000 people born on DDMMYY.

So in order for there to be so many such entries in common that people are denied use of their actual birthday, then one of the following must be true:

1. The XXX counter must be extremely small, in order for it to run out as a result of people 'using up' those Jan 1 dates each year

2. The number of people born on Jan 1 or immigrating to Norway without knowledge of their birthday must be colossal

If it was just DDMMXXXXX (no year) then I can see how this system would fall apart rapidly, but when you're dealing with specifically "people born on Jan 1 2014 or who immigrated to Norway and didn't know their birthday and were born on/around 2014 so that was the year chosen" I'm not sure how that becomes a sufficiently large number to cause these issues. Perhaps this only occurs in specific years where huge numbers of poorly-documented refugees are accepted?

(Happy to be educated, as I must be missing something here)

cycomanic5mo ago

Like the other poster said, this is a problem with default values not encoding the birthday into the personnummer.

I think it also is important to remember the purpose of specific numbers. For instance I would argue a PN without the birthday would be strictly worse. With the current system (I only know the Swedish one, but assume it's the same) I only have to remember a 4 digit (because the number is bdate + unique 4 digits). If we would instead use completely random numbers I would have to remember at least an 8 digit number (and likely to be future proof you'd want at least 9 digits). Sure that's fine for myself (although I suspect some people already struggle with it), but then I also have to remember the numbers for my 2 kids and my partner and things become quickly annoying. Especially, because one doesn't use the numbers often enough that it becomes easy, but still often enough that it becomes annoying to look up, especially when one doesn't always cary their phone with them.

guhcampos5mo ago

It's not that bad. Brazilian CPF are 11 numbers and everyone remembers them. You just get use to it =)

sgarland5mo ago

> Permanent identifiers should not carry data.

Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.

> A prime example of premature optimization.

Disagree. Data is sticky, and PKs especially so. Moreover, if you’re going to spend time optimizing anything early on, it should be your data model.

> Don't make decisions you will regret just to shave off a couple of milliseconds!

A bad PK in some databases (InnoDB engine, SQL Server if clustered) can cause query times to go from sub-msec to tens of msec quite easily, especially with cloud solutions where storage isn’t node-local. I don’t just mean a UUID; a BIGINT PK on a 1:M can destroy your latency for the simple reason of needing to fetch a separate page for every record. If instead the PK is a composite of (<linked_id>, id) - e.g. (user_id, id) - where id is a monotonic integer, you’ll have WAY better data locality.

Postgres suffers a different but similar problem with its visibility map lookups.

liuliu5mo ago

I read it (and regret it is a waste of my time). Their arguments are:

* integer keys are faster;

* uuidv7 keys are faster;

* if you want obfuscated keys, using integer and do some your own obfuscation (!!!).

I can get on-board of uuidv7 (with the trade-off, of course, on stronger guessability). The integer keys argument is strange. At that point, you need to come up with a custom-built system to avoid id collision in a distribution system and tries to achieve only 2x saving (the absolute minimal you should do is 64-bit keys). Very puzzling suggestion and to me very wrong.

Note that in this entire article, the recommendation is not about using natural keys (email address, some composite of user identification etc.), so I am skipping that whole discussion.

sgarland5mo ago

You can hand out chunks of sequential ids from a central coordinator to avoid collision; this is a well-established pattern.

Re: natural keys (or something like it), I was using it as an example of how badly PK choice can impact performance at scale.

1 more reply

PunchyHamster5mo ago

> (with the trade-off, of course, on stronger guessability).

you're not guessing 2^72 bit random number. And if guessing UUID does something in your app, you already fucked up

lukeschlather5mo ago

> Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.

I am not a cryptographer, but I would want his recommendation reviewed by a cryptographer. And then I would have to implement it. UUIDs have been extensively reviewed by cryptographers, I have a variety of excellent implementations I can use, I know they solve the problem well. I know they can cause performance issues; they're a security feature that is easy to implement, and I can deal with the performance issues if and when they crop up. (Which, in my experience, it's unusual. Even at a large company, most databases I encounter do not have enough data. I will err on the side of security until it becomes a problem, which is a good problem to have.)

alerighi5mo ago

Why they are a security feature? They are not, the article even says it. Even if UUID4 are random, nobody guarantees that they are generated with a cryptographically secure random number generator, and in fact most implementations don't!

The reason why in a lot of context you use UUID is when you have a distributed system where you want your client to decide the ID that is then stored in multiple systems that not communicate. This is surely a valid scenario for random UUID.

To me the rule is use UUID as a customer-facing ID for things that has to have an identity (e.g. a user, an order, etc) and expose it publicly through APIs, use integer ID as internal identifier that are used to create relations between entities, and interal IDs are always kept private. That way numeric ID that are more efficient remain inside the database and are used for joining data, UUID is used only for accessing the object from an API (for example) but then internally when joining (where you have to deal with a lot of rows) you can use the more efficient numeric ID.

By the way, I think that the thing of "using UUID" came from NoSQL databases, where surely you use an UUID, but also you don't have to join data. People than transposed a best practice in one scenario to SQL, where its not really that best practice...

1 more reply

PunchyHamster5mo ago

The cause is more just "not having enough bits". UUID is 128 bit. You're not running out even if you use part for timestamp, the random part will be big enough.

Like, it's a valid complaint.. just not for discussion at hand.

Also, we do live in reality and while having entirely random one might be perfect from theory of data, in reality having it be prefixed by date have many advantages performance wise.

> Permanent identifiers should not carry data. This is like the cardinal sin of data management

As long as you don't use the data and have actual fields for what's also encoded in UUID, there is absolutely nothing wrong with it, provided there is enough of the random part to get around artifacts in real life data.

Traubenfuchs5mo ago

> Norwegian PNs have your birth date

Same with Austrian social security numbers, which, in somes cases, don't contain the persons birth date and in some cases don't contain any existing date at all.

Yet many websites enforce a valid date and pull the persons birthdate from it...

mzi5mo ago

> Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right?

I guess that Norway has solved it in the same or similar way as Sweden? So a person is identified by the PNR and for those systems that need to track a person over several PNR (government agencies) use PRI. And a PRI is just the first PNR assigned to a person with a 1 inserted in the middle. If that PRI is occupied, use a 2,and so on.

PRI could of course have been a UUID instead.

oblio5mo ago

> Well, wrong, since although the date doesn't change.

Someone should have told Julius Caesar and Gregory XIII that :-p

scottlamb5mo ago

> Permanent identifiers should not carry data.

I think you're attacking a straw man. The article doesn't say "instead of UUIDv4 primary keys, use keys such as birthdays with exposed semantic meaning". On the contrary, they have a section about how to use sequence numbers internally but obfuscated keys externally. (Although I agree with dfox's and formerly_proven's comments [1, 2] that XOR method they proposed for this is terrible. Reuse of a one-time pad is probably the most basic textbook example of bad cryptography. They referred to the values as "obfuscated" so they probably know this. They should have just gone with a better method instead.)

[1] https://news.ycombinator.com/item?id=46272985

[2] https://news.ycombinator.com/item?id=46273325

naasking5mo ago

I don't think the objection is that it exposes semantic meaning, but that any meaningful information is contained within the key at all, eg. even a UUID that includes timestamp information about when it was generated is "bad" in a sense, as it leaks information. Unique identifiers should be opaque and inherently meaningless.

scottlamb5mo ago

Your understanding is inconsistent with the examples in vintermann's comment. Using a sequence number as an internal-only surrogate key (deliberately opaqued when sent outside the bounds of the database) is not the same as sticking gender identity, birth date, or any natural properties of a book into a broadly shared identifier.

1 more reply

patmorgan235mo ago

Insert order or time is information. And if you depend on that information you are going to be really disappointed when back dated records have to be inserted.

scottlamb5mo ago

Right, to ensure your clients don't depend on that information, make the key opaque outside the database through methods such as the ones dfox and formerly_proven suggested, as I said.

maxbond5mo ago

> Permanent identifiers should not carry data.

Do you have the same criticism for serial identifiers? How about hashes? What about the version field in UUIDs?

asah5mo ago

counterpoint: IRL, data values in a system like PostgreSQL are padded to word boundaries so either you're wasting bits or "carrying data."

stickfigure5mo ago

This is incredibly database-specific. In Postgres random PKs are bad. But in distributed databases like Cockroach, Google Cloud Datastore, and Spanner it is the opposite - monotonic PKs are bad. You want to distribute load across the keyspace so you avoid hot shards.

HendrikHensen5mo ago

In Google Cloud Bigtable we had the issue that our domain's primary key was a sequential integer autogenerated by another app. So we just reversed it, and it distributed automatically quite nicely.

bruckie5mo ago

Doing this (reversing the bits of sequential ids) is common guidance inside Google for its distributed datastores like Spanner and Bigtable.

dap5mo ago

It is, although you can have sharded PostgreSQL, in which case I agree with your assessment that you want random PKs to distribute them.

It's workload-specific, too. If you want to list ranges of them by PK, then of course random isn't going to work. But then you've got competing tensions: listing a range wants the things you list to be on the same shard, but focusing a workload on one shard undermines horizontal scale. So you've got to decide what you care about (or do something more elaborate).

hedora5mo ago

It's also application specific. If you have workload that's write heavy, has temporal skew and is highly concurrent, but rarely creates new records, you're probably better off with a random PK, even in PG.

whizzter5mo ago

Even in a distributed database you want increasing (even if not monotonic) keys since the underlying b-tree or whatever will very likely behave badly for entirely random data.

UUIDv7 is very useful for these scenarios since

A: A hash or modulus of the key will be practically random due to the lower bits being random or pseudo-random (ie distributes well between nodes)

B: the first bits are sortable.. thus the underlying storage on each node won't go bananas.

dspillett5mo ago

I wouldn't say it is incredibly database specific, it is more database type specific. For most general, non-sharded, databases, random key values can be a problem as they lead to excess fragmentation in b-trees and similar structures.

kortex5mo ago

As long as the key has sufficient entropy (i.e. not monotonic sequential ints), that ensures the keyspace is evenly distributed, correct? So UUID>=v4, ULID, KSUID, possibly snowflake, should be fine for the sake of even distribution of the hashes.

jakeydus5mo ago

I think they address this in the article when they say that this advice is specific to monolithic applications, but I may be misremembering (I skimmed).

andy_ppp5mo ago

Are you saying a monolith cannot use a distributed database?

jakeydus5mo ago

I'm not making any claims at all, I was just adding context from my recollection of the article that appeared to be missing from the conversation.

Edit: What the article said: > The kinds of web applications I’m thinking of with this post are monolithic web apps, with Postgres as their primary OLTP database.

So you are correct that this does not disqualify distributed databases.

PunchyHamster5mo ago

UUIDv7 is not monotonic tho

nonethewiser5mo ago

>In Postgres random PKs are bad

Why?

nutjob25mo ago

> You want to distribute load across the keyspace so you avoid hot shards.

This is just another case of keys containing information and is not smart.

The obvious solution is to have a field that drives distribution, allowing rebalancing or whatever.

stickfigure5mo ago

This is something you should discuss with the developers at Cockroach Labs, Google Cloud, et al.

As a consumer of these databases we're stuck with them as designed, which means we have to worry about key distribution.

conradludgate5mo ago

100%. You can use rendezvous hashing to determine the shard(s). The hash of a sequence should be randomly distributed as changing the LSB should propagate to 50% change in the output bits.

benterix5mo ago

The article sums up some valid arguments against UUIDv4 as PKs but the solution the author provides on how to obfuscate integers is probably not something I'd use in production. UUIDv7 still seems like a reasonable compromise for small-to-medium databases.

mort965mo ago

I tend to avoid UUIDv7 and use UUIDv4 because I don't want to leak the creation times of everything.

Now this doesn't work if you actually have enough data that the randomness of the UUIDv4 keys is a practical database performance issue, but I think you really have to think long and hard about every single use of identifiers in your application before concluding that v7 is the solution. Maybe v7 works well for some things (e.g identifiers for resources where creation times are visible to all with access to the resource) but not others (such as users or orgs which are publicly visible but without publicly visible creation times).

nbadg5mo ago

I'm also not a huge fan of leaking server-side information; I suspect UUIDv7 could still be used in statistical analysis of the keyspace (in a similar fashion to the german tank problem for integer IDs). Also, leaking data about user activity times (from your other comment) is a *really* good point that I hadn't considered.

I've read people suggest using a UUIDv7 as the primary key and a UUIDv4 as a user-visible one as a remedy.

My first thought when reading the suggestion was, "well but you'll still need an index on the v4 IDs, so what does this actually get you?" But the answer is that it makes joins less expensive; you only require the index once, when constructing the query from the user-supplied data, and everything else operates with the better-for-performance v7 IDs.

To be clear, in a practical sense, this is a bit of a micro-optimization; as far as I understand it, this really only helps you by improving the data locality of temporally-related items. So, for example, if you had an "order items" table, containing rows of a bunch of items in an order, it would speed up retrieval times because you wouldn't need to do as many index traversals to access all of the items in a particular order. But on, say, a users table (where you're unlikely to be querying for two different users who happen to have been created at approximately the same time), it's not going to help you much. Of course the exact same critique is applicable to integer IDs in those situations.

Although, come to think of it, another advantage of a user-visible v4 with v7 Pk is that you could use a different index type on the v4 ID. Specifically, I would think that a hash index for the user-visible v4 might be a halfway-decent way to go.

I'm still not sure either way if I like the idea, but it's certainly not the craziest thing I've ever heard.

thayne5mo ago

I think a bigger benefit from doing that would be that inserts would be cheaper. Instead of an expensive insert into the middle of an index for every table that needs an index on that key, you can do a cheaper insert at the end of the index for all of them except for the one that uses uuid4.

But if you are doing that, why not just use an incrementing integer instead of a uuidv7?

1 more reply

throw0101a5mo ago

> I tend to avoid UUIDv7 and use UUIDv4 because I don't want to leak the creation times of everything.

See perhaps "UUIDv47 — UUIDv7-in / UUIDv4-out (SipHash‑masked timestamp)":

* https://github.com/stateless-me/uuidv47

* Sept 2025: https://news.ycombinator.com/item?id=45275973

wongarsu5mo ago

If that kind of stuff is on the able you can also use boring 64bit integer keys and encrypt those (e.g. [1]). Which in the end is just a better thought out version of what the article author did.

UUIDv47 might have a space if you need keys generated on multiple backend servers without synchronization. But it feels very niche to me.

1: https://wiki.postgresql.org/wiki/XTEA_(crypt_64_bits)

cdmckay5mo ago

Out of curiosity, why is it an issue if you leak creation time?

bruce5115mo ago

The issue will be very context specific. In other words to (reasonably) answer the question, we'd have to judge each application individually.

For one example, say you were making voting-booth software. You really don't want a (hidden) timestamp attached to each vote (much less an incrementing id) because that would break voter confidentiality.

More generally, it's more a underlying principle of data management. Not leaking ancillary data is easier to justify than "sure we leak the date and time of the record creation, but we can't think of a reason why that matters."

Personally I think the biggest issue are "clever" programmers who treat the uuid as data and start displaying the date and time. This leads to complications ("that which is displayed, the customer wants to change"). It's only a matter of time before someone declares the date "wrong" and it must be "fixed". Not to mention time zone or daylight savings conversions.

1 more reply

mort965mo ago

Well you're leaking user data. I'm sure you can imagine situations where "the defendant created an account on this site on this date" could come up. And the user could have created that account not knowing that the creation date is public, because it's not listed anywhere in the publicly viewable part of the profile other than the UUID in the URL.

2 more replies

natch5mo ago

If your system (pseudo-) random number generator (RNG) is compromised to derive a portion of its entropy from things that are knowable by knowing the time when the function ran, then the search space for cracking keys created around the same time can be shrunken considerably.

This doesn’t even rely on your system’s built-in RNG being low quality. It could be audited and known to avoid such issues but you could have a compromised compiler or OS that injects a doctored RNG.

kreetx5mo ago

E.g, if your service users have timestamp as part of the key and this data is visible to other users, you would know when that account was created. This could be an issue.

saaspirant5mo ago

There was a HN comment about competitors tracking how many new signups are happening and increasing the discounts/sales push based on that. Something like this.

2 more replies

dboreham5mo ago

Apart from all the other answers here: an external entity knowing the relative creation time for two different accounts, or just that the two accounts were created close in time to each other can represent a meaningful information leak.

robertlagrant5mo ago

Depends on the data. If you use a primary key in data about a person that shouldn't include their age (e.g. to remove age-based discrimination) then you are leaking an imperfect proxy to their age.

1 more reply

Bombthecat5mo ago

Admins, early users, founders, CEOs etc etc would have althe lowest creation time...

barrkel5mo ago

You shouldn't generally use PKs as public identifiers, least of all UUIDs, which are pretty user hostile.

mort965mo ago

I really don't see the issue with having a UUID in a URL.

formerly_proven5mo ago

If all you want is to obfuscate the fact that your social media site only has 200 users and 80 posts, simply use a permutation over the autoincrement primary key. E.g. IDEA or CAST-128, then encode in base64. If someone steps on your toes because somewhere in your codebase you're using a forbidden legacy cipher, just use AES-128. (This is sort of the degenerate/tautological base case of format-preserving encryption)

(What do you think Youtube video IDs are?)

enz5mo ago

The problem with this approach is that you now have to manage a secret key/secret for a (maybe) a very long time.

I shared this article a few weeks ago, discussing the problems with this kind of approach: https://notnotp.com/notes/do-not-encrypt-ids/

I believe it can make sense in some situations, but do you really want to implement such crypto-related complexity?

formerly_proven5mo ago

The article is self-contradictory in that it acts like that key is super-important ("Operations becomes a nightmare. You now have a cryptographic secret to manage. Where does this key live? Protected by a wrapping key living in a KMS or HSM? Do you use the same key across prod, staging, and dev? If dev needs to test with prod data, does it need access to prod encryption keys? What about CI pipelines? Local developer machines?") but then also acknowledges that we're talking about an obfuscation layer of stuff which is not actually sensitive ("to hide timestamps that aren't sensitive"). Don't get me wrong, it's a definitive drawback for scaling the approach, but most applications have to manage various secrets, most of which are actually important. E.g. session signing keys, API keys etc. It's still common for applications to use signed session with RCE data formats. The language from that article, while not wrong, is much more apt for those keys.

That being said, while fine for obfuscation, it should not be used for security for this purpose, e.g. hidden/unlisted links, confirmation links and so on. Those should use actual, long-ish random keys for access, because the inability to enumerate them is a security feature.

benterix5mo ago

I always thought they are used and stored as they are because the kind of transformation you mention seems terribly expensive given the YT's scale, and I don't see a clear benefit of adding any kind of obfuscation here.

pdimitar5mo ago

> What do you think Youtube video IDs are?

I actually haven no idea. What are they?

(Also what is the format of their `si=...` thing?)

intalentive5mo ago

Can’t recall where I heard this, but I’m pretty sure the si=… is tracking information that associates the link with the user who shared it.

1 more reply

wpollock5mo ago

YouTube video ids are just integers in a base-64 encoding, modified to be URL safe.

1 more reply

Retr0id5mo ago

Why not use AES-128 by default? Your CPU has instructions to accelerate AES-128.

conradfr5mo ago

Can't you just change the starting value of your sequence?

stickfigure5mo ago

In Postgres I often like to use a single sequence for everything. It leaks some information yes but in a busy system it tends to be "obscure enough".

x0x05mo ago

It's not leaking that's the concern. It's that not having the names of objects be easily enumerable is a strongly security-enhancing feature of a system.

Yes of course everyone should check and unit test that every object is owned by the user or account loading it, but demanding more sophistication from an attacker than taking "/my_things/23" and loading "/my_things/24" is a big win.

stickfigure5mo ago

With a single sequence and a busy system, the ids for most high-level tables/collection are extremely sparse. This doesn't mean they can't be enumerated, but you will probably notice if you suddenly start getting hammered with 404s or 410s or whatever your system generates on "not found".

Also, if most of your endpoints require auth, this is not typically a problem.

It really depends on your application. But yes, that's something to be aware of. If you need some ids to be unguessable, make sure they are not predictable :-)

2 more replies

gwbas1c5mo ago

I work on an application where we encrypt the integer primary key and then use the bytes to generate something that looks like a UUID.

In our case, we don't want database IDs in an API and in URLs. When IDs are sequential, it enables things like dictionary attacks and provides estimates about how many customers we have.

Encrypting a database ID makes it very obvious when someone is trying to scan, because the UUID won't decrypt. We don't even need a database round trip.

puilp05025mo ago

Few questions:

* How do you manage the key for encrypting IDs? Injected to app environment via envvar? Just embedded in source code? I ask this because I'm curious as to how much "care" I should be putting in into managing the secret material if I were to adopt this scheme.

* Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.

gwbas1c5mo ago

> How do you manage the key for encrypting IDs?

The same way we manage all other secrets in the application. (Summarized below)

> Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.

I don't have the source handy at the moment. It's one of the easier to use symmetric algorithms available in .Net. We aren't talking military-grade security here. In general: a 32-bit int encrypts to 64-bits, so we pad it with a few unicode characters so it's 64-bits encrypted to 128 bits.

---

As far as managing secrets in the application: We have a homegrown configuration file generator that's adapted to our needs. It generates both the configuration files, and strongly-typed classes to read the files. All configuration values are loaded at startup, so we don't have to worry about runtime errors from missing configuration values.

Secrets (connection strings, encryption keys, ect,) are encrypted in the configuration file as base64 strings. The certificate to read/write secrets are stored in Azure Keyvault.

The startup logic in all applications is something like:

1: Determine the environment (production, qa, dev)

2: Get the appropriate certificate

3: Read the configuration files, including decrypting secrets (such as the primary key encryption keys) from the configuration files

4: Populate the strongly-typed objects that hold the configuration values

5: These objects are dependency-injected to runtime objects

sedatk5mo ago

That sounds quite troublesome if the encryption key is lost, compromised, or rotated for any other reason.

cowboylowrez5mo ago

Key management seems to be as important as backups but I understand that something so small (an encryption key) could seem unimportant because database backups are so big lol but they really do share important attributes (do not lose your keys, do not lose your data, do not expose your keys, do not expose your data, etc etc)

gwbas1c5mo ago

The key will never be rotated. (There's no reason to.)

We're not worried about key compromises.

If the key is lost, we have much bigger problems.

dotancohen5mo ago

From the fine article:

  > Random values don’t have natural sorting like integers or lexicographic (dictionary) sorting like character strings. UUID v4s do have "byte ordering," but this has no useful meaning for how they’re accessed.

Might the author mean that random values are not sequential, so ordering them is inefficient? Of course random values can be ordered - and ordering by what he calls "byte ordering" is exactly how all integer ordering is done. And naive string ordering too, like we would do in the days before Unicode.

kreetx5mo ago

Using an UUIDv4 as primary key is a trade-off: you use it when you need to generate unique keys in a distributed manner. Yes, these are not datetime ordered and yes, they take 128 bits of space. If you can't live with this, then sure, you need to consider alternatives. I wonder if "Avoid UUIDv4 Primary Keys" is a rule of thumb though.

dotancohen5mo ago

If one needs timestamp ordering, then UUIDv7 is a good alternative.

But the author does not say timestamp ordering, he says ordering. I think he actually means and believes that there is some problem ordering UUIDv4.

kreetx5mo ago

Yup. There are alternatives depending on what the situation is: with non-distributed, you could just use a sufficiently sized int (which can be rather small when the table is for e.g humans). You could add a separate timestamp column if that is important.

But if you need UUID-based lookup, then you might as well have it as a primary key, as that will save you an extra index on the actual primary key. If you also need a date and the remaining bits in UUIDv7 suffice for randomness, then that is a good option too (though this does essentially amount to having a composite column made up of datetime and randomness).

marcosdumay5mo ago

> you use it when you need to generate unique keys in a distributed manner

Just to complement this with a point, but there isn't any mainstream database management system out there that is distributed on the sense that it requires UUIDs to generate its internal keys.

There exist some you can find on the internet, and some institutions have internal systems that behave this way. But as a near universal rule, the thing people know as a "database" isn't distributed on this sense, and if the column creation is done inside the database, you don't need them.

torginus5mo ago

I do not understand why 128 bits is considered too big - you clearly can't have less, as on 64 bits the collision probability on real world workloads is just too high, for all but the smallest databases.

Auto-incrementing keys can work, but what happens when you run out of integers? Also, distributed dbs probably make this hard, and they can't generate a key on client.

There must be something in Postgres that wants to store the records in PK order, which while could be an okay default, I'm pretty sure you can this behavior, as this isn't great for write-heavy workloads.

anarazel5mo ago

The issue is more fundamental - if you have purely random keys, there's basically no spatial locality for the index data. Which means that for decent performance your entire index needs to be in memory, rather than just recent data. And it means that you have much bigger write amplification, since it's rare that the same index page is modified multiple times close-enough in time to avoid a second write.

vbezhenar5mo ago

You won't run out of 64-bit integer. IMO, 64-bit integer (and even less for some tables that's not expected to grow much) it the best approach for internal database ID. If you want to expose ID, it might make sense to introduce second UUID for selected tables, if you want to hide internal ID.

p1necone5mo ago

I doubt many real world use cases would run out of incrementing 64 bit ids - collisions if they were random sure, but i64 max is 9,223,372,036,854,775,807 - if each row took only 1 bit of space, that would be slightly more than an exabyte of data.

andatki5mo ago

Hi there. Thanks for the feedback. I updated that section to hopefully convey the intent more. The type of ordering we care about for this topic is really B-Tree index traversal when inserting new entries and finding existing entries (single and multiple values i.e. an IN clause, updates, deletes etc). There's a compelling example I re-created from Cybertec showing the pages needed and accessed for equivalent user-facing results, comparing storing PKs as big integers vs. UUID v4s, and how many more pages were needed for v4 UUIDs. I found that to be helpful to support my real world experience as a consultant on various "medium sized" Postgres databases (e.g. single to 10s of millions of records) where clients were experiencing excessive latency for queries, and the UUID v4 PK/FKs selection made for reasons earlier was one of the main culprits. The indexes wouldn’t fit into memory resulting in a lot of sequential scans. I’d confirm this by showing an alternative schema design and set of queries where everything was the same except integer PKs/FKs were used. Smaller indexes (fit in memory), reliable index scans, less latency, faster execution time.

torginus5mo ago

To be polite, I don't think this article rests on sound technical foundations.

mrinterweb5mo ago

How so?

K0nserv5mo ago

Isn't part of this that inserting into a btree index is more performant when the keys are increasing rather than being random? A random id will cause more re-balancing operations than always inserting at the end. Increasing ids are also more cache friendly

sgarland5mo ago

Yes, and for Postgres, it also causes WAL bloat due to the high likelihood of full page writes.

dotancohen5mo ago

Could you expand on this? I use postgres often and though I could have an LLM explain what you mean, I think I'd learn more to hear it from you. Thank you.

dagss5mo ago

The point is how closely located data you access often is. If data is roughly sorted by creation time then data you access close to one another in time is stored close to one another on disk. And typically access to data is correlated with creation time. Not for all tables but for many.

Accessing data in totally random locations can be a performance issue.

Depends on lots of things ofc but this is the concern when people talk about UUID for primary keys being an issue.

hashmush5mo ago

Agree, I did a double take on this too.

Values of the same type can be sorted if a order is defined on the type.

It's also strange to contrast "random values" with "integers". You can generate random integers, and they have a "sorting" (depending on what that means though)

dev_l1x_be5mo ago

Why would you need to order by UUID? I am missing something here. Most of the time we use UUID keys for being able to create a new key without coordination and most of the time we do not want to order by primary key.

sagarm5mo ago

Most common database indexes are ordered, so if you are using UUIDv4 you will not only bloat the index you will also have poor locality. If you try to use composite keys to fix locality, you'll end up with an even more bloated index.

hans_castorp5mo ago

I have seen a lot of people sort by (generated) integer values to return the rows "in creation order" assuming that sorting by an integer is somehow magically faster than sorting by a proper timestamp value (which give a more robust "creation order" sorting than a generated integer value).

sgarland5mo ago

Assuming the integer value is the PK, it can in fact be much faster for MySQL / MariaDB due to InnoDB’s clustering index. If it can do a range scan over the PK, and that’s also the ORDER BY (with matching direction), congratulations, the rows are already ordered, no sort required. If it has to do a secondary index lookup to find the rows, this is not guaranteed.

crest5mo ago

Any fixed sized bitstring has an obvious natural ordering, but since they're allocated randomly they lack the density and locality of sequential allocation.

iainctduncan5mo ago

Counterargument... I do technical diligence so I talk to a lot of companies at points of inflection, and I also talk to lots who are stuck.

The ability to rapidly shard everything can be extremely valuable. The difference between "we can shard on a dime" and "sharding will take a bunch of careful work" can be expensive If the company has poor margins, this can be the difference between "can scale easily" and "we're not getting this investment".

I would argue that if your folks have the technical chops to be able to shard while avoiding surrogate guaranteed unique keys, great. But if they don't.... a UUID on every table can be a massive get-out-of-jail free card and for many companies this is much, much important than some minor space and time optimizations on the DB.

Worth thinking about.

jessep5mo ago

Sort of related, but we had to shard as usage grew and didn’t have uuids and it was annoying. Wasn’t the most annoying bit though. Whole thing is pretty complex regardless of uuid, if you have a highly interconnected data model that needs to stay online while migrating.

iainctduncan5mo ago

Right, but if you start off with uuids and the expectation that you might use them to shard, you'll wind up factoring that into the data model. Retrofitting, as you rightly say, can be much harder.

jryio5mo ago

I also do technical diligence, often times the blocker to sharding is the target company's tenancy model and schema. PK data type is certainly a blocker.

Definitely an issue, rarely the main one. You can work around integer PKs with composite keys or offset-based sharding schemes. What you can't easily fix is a schema with cross-tenant foreign keys, shared lookup tables, or a tenancy model that wasn't designed for data isolation from day one. Those are architectural decisions that require months of migration work.

UUIDs buy you flexibility, sure. But if your data model assumes everything lives in one database, the PK type is a sub-probem of your problems.

mrinterweb5mo ago

I can see how sharding could be difficult with a bigint FK, but UUIDv7 would still play nice, if I understand your point correctly. Monotonically increasing foreign keys have performance benefits over random UUIDv4 FKs in postgresql is the point of the article.

ivan_gammel5mo ago

The is article is about a solution in search of a problem, a classic premature optimization issue. UUIDv4 is perfectly fine for many use cases, including small databases. Performance argument must be considered when there’s a problem with performance on the horizon. Other considerations may be and very often superior to that.

sagarm5mo ago

It's not really feasible to rekey your UUIDv4 keyed database to int64s after the fact, imo. Sure your new tables could be integer-keyed, but the bulk of your storage will be UUID (and UUIDv4, if that's what you started with) for a very long time

ivan_gammel5mo ago

Yes, sure. My point is, it may never be necessary.

sagarm5mo ago

I think you're right that it won't matter for most companies. But having been at a company with persistent DB performance issues with UUIDv4 keys as a contributing factor, it sucks.

mrkeen5mo ago

If you have too many UUIDs, throw more DBs at the problem. Hopefully you haven't tied yourself to any architectural decisions that would limit you to only using a single database.

sgarland5mo ago

IME, when performance issues become obvious, the devs are in growth mode and have no desire / time to revisit PK choice.

Integer PKs were seen as fine for years - decades, even - before the rise of UUIDs.

can3p5mo ago

> For many business apps, they will never reach 2 billion unique values per table, so this will be adequate for their entire life. I’ve also recommended always using bigint/int8 in other contexts.

I'm sure every dba has a war story that starts with similar decision in the past

xandrius5mo ago

To summarise the article: in PG, prefer using UUIDv7 over UUIDv4 as they have slightly better performance.

If you're using latest version of PG, there is a plugin for it.

That's it.

sbuttgereit5mo ago

You might have missed the big H2 section in the article:

"Recommendation: Stick with sequences, integers, and big integers"

After that then, yes, UUIDv7 over UUIDv4.

This article is a little older. PostgreSQL didn't have native support so, yeah, you needed an extension. Today, PostgreSQL 18 is released with UUIDv7 support... so the extension isn't necessary, though the extension does make the claim:

"[!NOTE] As of Postgres 18, there is a built in uuidv7() function, however it does not include all of the functionality below."

What those features are and if this extension adds more cruft in PostgreSQL 18 than value, I can't tell. But I expect that the vast majority of users just won't need it any more.

tmountain5mo ago

Sticking with sequences and other integer types will cause problems if you need to shard later.

zwnow5mo ago

Especially in larger systems, how does one solve the issue of reaching the max value of an integer in their database? Sure for unsigned bigint thats hard to achieve but regular ints? Apps quickly outgrow that.

1 more reply

sgarland5mo ago

There are plenty of ways to deal with that. You can shard by some other identifier (though I then question your table design), you can assign ranges to each shard, etc.

bigmadshoe5mo ago

I’m really no expert on sharding but if you’re using increasing ints why can’t you just shard on (id % n) or something?

1 more reply

SoftTalker5mo ago

This is mentioned, and in many applications you can safely say you will never need to shard.

1 more reply

hans_castorp5mo ago

With the latest Postgres version (>= 18) you do NOT need a plugin

K0nserv5mo ago

An additional thing I learned when I worked on a ulid alternative over the weekend[0] is: Postgres's internal Datum type is at most 64 bits which means every uuid requires heap allocation[1] (at least until we get 128 bit machines).

0: https://bsky.app/profile/hugotunius.se/post/3m7wvfokrus2g

1: https://github.com/postgres/postgres/blob/master/src/backend...

qntmfred5mo ago

you may be interested in this postgres extension as well

https://github.com/blitss/typeid-postgres

K0nserv5mo ago

Finished an initial version of mine https://github.com/k0nserv/plid

K0nserv5mo ago

That is indeed interesting.

I have slightly different goals for my version. I want everything to fit in 128 bits so I'm sacrificing some of the random bits, I'm also making sure the representation inside Postgres is also exactly 128 bits. My initial version ended up using CBOR encoding and being 160 bits.

Mine dedicates 16 bits for the prefix allowing up to 3 characters (a-z alphabet).

grugdev425mo ago

A much simpler solution is to keep your tables as they are (with an integer primary key), but add a non sequential public identifier too.

id => 123, public_id => 202cb962ac59075b964b07152d234b70

There are many ways to generate the public_id. A simple MD5 with a salt works quite well for extremely low effort.

Add a unique constraint on that column (which also indexes it), and you'll be safe and performant for hundreds of millions of rows!

Why do we developers like to overcomplicate things? ;)

Retr0id5mo ago

Per https://news.ycombinator.com/item?id=46273325, if you use a block cipher rather than a hash then you don't even need to store it anywhere.

Denvercoder95mo ago

This misses the point. The reason not to use UUIDv4 is that having an index on random values is slow(er), because sequential inserts into the underlying B-tree are faster than random inserts. You're hitting the same problem with your `public_id` column, that it's not the primary key doesn't change that.

sgarland5mo ago

For InnoDB-based DBs that are not Aurora, and if the secondary index isn’t UNIQUE, it solves the problem, because secondary non-unique index changes are buffered and written in batches to amortize the random cost. If you’re hashing a guaranteed unique entity, I’d argue you can skip the unique constraint on this index.

For Aurora MySQL, it just makes it worse either way, since there’s no change buffer.

hnfong5mo ago

Ints as pk would be quicker for joins etc though.

grugdev425mo ago

Exactly

mcny5mo ago

Postgresql 18 released in September and has uuidv7

https://www.postgresql.org/docs/current/functions-uuid.html

mexicocitinluez5mo ago

You'll have to rip the ability to generate unique numbers from quite literally anywhere in my app and save them without conflict from my cold, dead hands.

The ability to know ahead of time what a primary key will be (in lieu of persisting it first, then returning) opened up a whole new world of architecting work in my app. It made a lot of previous awkward things feel natural.

sgarland5mo ago

Sounds like a lot of referential integrity violations.

mexicocitinluez5mo ago

Why would generating a PK ahead of time cause referential integrity violations? Super curious to find out.

sgarland5mo ago

The implication is that you need to know the PK ahead of time so that you can insert it into other tables which reference it as an FK without waiting for it to be returned, which further implies that you don’t have FK constraints, because the DB would disallow this.

Tbf in Postgres, you can declare FKs to be deferrable, so their existence is checked at transaction commit, rather than at insertion time.

If you don’t have the DB enforcing referential integrity, you need to be extremely careful in your application logic; IME, this inevitably fails. At some point, someone writes bad code, and you get data anomalies.

1 more reply

FancyFane5mo ago

Even MySQL benefits from these changes as well. What we're really discussing is random primary key inserts (UUIDv4) vs incrementing primary key inserts (UUIDv6 or v7).

PlanetScale wrote up a really good article on why incrementing primary keys are better for performance when compared to randomly inserted primary keys; when it comes to b-tree performance. https://planetscale.com/blog/btrees-and-database-indexes

stevefan19995mo ago

Try Snowflake ID then: https://en.wikipedia.org/wiki/Snowflake_ID

gizzlon5mo ago

Leaks time and order, just like uuid v7.

0x4575mo ago

Kinda the point? Ordered PKs give you a better index.

dspillett5mo ago

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities

It is not just about being hard to guess a valid individual identifier in vacuum. Random (or at least random-ish) values, be they UUIDs or undecorated integers, in this context are also about it being hard to guess one from another, or a selection of others.

Wrt: "it isn't x it is y" form: I'm not an LLM, 'onest guv!

dfox5mo ago

> Creating obfuscated values using integers

While that is often neat solution, do not do that by simply XORing the numbers with constant. Use a block cipher in ECB mode (If you want the ID to be short then something like NSA's Speck comes handy here as it can be instantiated with 32 or 48 bit block).

And do not even think about using RC4 for that (I've seen that multiple times), because that is completely equivalent to XORing with constant.

hippo225mo ago

The author should include benchmarks otherwise, saying that UUIDs “increase latency” is meaningless. For instance, how much longer does it take to insert a UUID vs. an integer? How much longer does scanning an index take?

jakeydus5mo ago

The author doesn't reference any tests that they themselves ran, but they did link a cybertec article [0] with some benchmarks.

[0] https://www.cybertec-postgresql.com/en/unexpected-downsides-...

aynyc5mo ago

I've seen this type of advice a few times now. Now I'm not a database expert by any stretch of imagination, but I have yet to see UUID as primary key in any of the systems I've touched.

Are there valid reasons to use UUID (assuming correctly) for primary key? I know systems have incorrectly expose primary key to the public, but assuming that's not the concern. Why use UUID over big-int?

rcxdude5mo ago

Uuids also allow the generation of the ID to seperate from the insertion into the database, which can be useful in distributed systems.

cruffle_duffle5mo ago

I mean this is the primary reason right here! You can pre-create an entire tree of relationships client side and ship it off to the database with everything all nice and linked up. And since by design each PK is globally unique you’ll never need to worry about constraint violations. It’s pretty damn nice.

ellisv5mo ago

About 10 years ago I remember seeing a number of posts saying "don't use int for ids!". Typically the reasons were things like "the id exposes the number of things in the database" and "if you have bad security then users can increment/decrement the id to get more data!". What I then observed was a bunch of developers rushing to use UUIDs for everything.

UUIDv7 looks really promising but I'm not likely to redo all of our tables to use it.

dangets5mo ago

You can use the same techniques except with the smaller int64 space - see e.g. Snowflake ID - https://en.wikipedia.org/wiki/Snowflake_ID

andatki5mo ago

Note that if you’re using UUID v4 now, switching to v7 does not require a schema migration. You’d get the benefits when working with new records, for example reduced insert latency. The uuid data type supports both.

reffaelwallen5mo ago

At my company we only use UUIDs as PKs.

Main reason I use it is the German Tank problem: https://en.wikipedia.org/wiki/German_tank_problem

(tl;dr; prevent someone from counting how many records you have in that table)

littlestymaar5mo ago

What stops you from having another uuid field as publicly visible identifier (which is only a concern for a minority of your tables).

This way you avoid most of the issues highlighted in this article, without compromising your confidential data.

jakeydus5mo ago

I'm new to the security side of things; I can understand that leaking any information about the backend is no bueno, but why specifically is table size an issue?

boruto5mo ago

In my old company new joiners are assigned an monotonic number as id in tech. GitHub profile url reflected that.

Someone may or may not have used the pattern to get to know the attrition rate through running a simple script every month))

infragreen5mo ago

This was a great read, thank you for sharing!

andatki5mo ago

Appreciate it!

assbuttbuttass5mo ago

At least for the Spanner DB, it's good to have a randomly-distributed primary key since it allows better sharding of the data and avoids "hot shards" when doing a lot of inserts. UUIDv4 is the typical solution, although a bit-reversed incrementing integer would work too

https://cloud.google.com/blog/products/databases/announcing-...

bux935mo ago

Long article about why not to use UUIDv4 as Primary Keys, but.. Who is doing so? And why are they doing that? How would you solve their requirements? Just throwing out "you can use UUIDv7" doesn't help with, e.g., the size they take up.

Aren't people using (big)ints are primary keys, and using UUIDs as logical keys for import/export, solving portability across different machines?

Sayrus5mo ago

UUIDs are usually the go-to solution to enumeration problems. The space is large enough that an attacker cannot guess how many X you have (invoices, users, accounts, organizations, ...). When people replace the ints by UUIDv4, they keep them as primary keys.

bruce5115mo ago

I'd add that it's also used when data is created in multiple places.

Consider say weather hardware. 5 stations all feeding into a central database. They're all creating rows and uploading them. Using sequential integers for that is unnecessarily complex (if even possible.)

Given the amount of data created on phones and tablets, this affects more situations than first assumed.

It's also very helpful in export / edit / update situations. If I export a subset of the data (let's say to Excel), the user can edit all the other columns and I can safely import the result. With integer they might change the ID field (which would be bad). With uuid they can change it, but I can ignore that row (or the whole file) because what they changed it to will be invalid.

nrhrjrjrjtntbt5mo ago

Yes and the DB might be columnular or a distributed KV, sidestepping the index problem.

andatki5mo ago

This was written based on working on several Postgres databases at different companies of “medium” size as a consultant, that had excessive IO and latency and used UUID v4 PKs/FKs. They’re definitely out there. We could transform the schema for some key tables as a demonstration with big int equivalents and show the IO latency reduction. With that said, the real world PK data type migration is costly but becomes a business decision of whether to do or not.

marcus_holmes5mo ago

I didn't see my primary use case for UUID's covered: sharing identifiers across entities is dangerous.

I wrote a CRUD app for document storage. It had user id's and document id's. I wrote a method GetDocumentForUser(docID, userID) that checked permissions for that user and document and returned the document if permitted. I then, stupidly, called that method with GetDocumentForUser(userID, docID), and it took me a good half hour to work out why this never returned anything.

It never returned anything because a valid userID will never be a valid docID. If I had used integers it would have returned documents, and I probably wouldn't have spotted it while testing, and I would have shipped a change that cheerfully handed people other people's documents.

I will put up with a fairly considerable amount of performance hit to avoid having this footgun lurking. And yes, I know there are other ways around this (e.g. types) but those come with their own trade-offs too.

dimitrisnl5mo ago

Noob question, but why no use ints for PK, and UUIDs for a public_id field?

alerighi5mo ago

If you put an index on the UUID field (because you have an API where you can retrieve objects with UUID) you have kind of the same problem, at least in Postgres where a primary key index or a secondary index are more or less the same (to the point is perfectly valid in pgsql to not have any primary key defined for the table, because storage on disk is done trough an internal ID and the indexes, being primary or not, just reference to the rowId in memory). Plus the waste of space of having 2 indexes for the same table.

Of course this is not always the case that is bad, for example if you have a lot of relations you can have only one table where you have the UUID field (and thus expensive index), and then the relations could use the more efficient int key for relations (for example you have an user entity with both int and uuid keys, and user attribute references the user with the int key, of course at the expense of a join if you need to retrieve one user attribute when retrieving the user is not needed).

torginus5mo ago

You can create hash indexes in Postgres, so the secondary index uuid seems workable:

https://www.postgresql.org/docs/current/hash-index.html

edding45005mo ago

*edit: sorry, misread that. My answer is not valid to your question.

original answer: because if you dont come up with these ints randomly they are sequential which can cause many unwanted situations where people can guess valid IDs and deduce things from that data. See https://en.wikipedia.org/wiki/German_tank_problem

javawizard5mo ago

Hence the presumed implication behind the public_id field in GP's comment: anywhere identifiers are exposed, you use the public_id field, thereby preventing ID guessing while still retaining the benefits of ordered IDs where internal lookups are concerned.

Edit: just saw your edit, sounds like we're on the same page!

javaunsafe20195mo ago

So We make things hard in the backend because of leaky abstractions? Doesn't make sense imo.

jcims5mo ago

Decades of security vulnerabilities and compromises because of sequential/guessable PKs is (only!) part of the reason we're here. Miss an authorization check anywhere in the application and you're spoon-feeding entire tables to anyone with the inclination to ask for it.

dsego5mo ago

I also think we can use a combination of a PID - persistent ID (I always thought it was public) and an auto-increment integer ID. Having a unique key helps when migrating data between systems or referencing a piece of data in a different system. Also, using serial IDs in URLs and APIs can reveal sensitive information, e.g. how many items there are in the database.

scotty795mo ago

One of the benefits of UUIDs is that you can easily merge data coming from multiple databases. Auto-increments cause collisions.

grim_io5mo ago

The article mentions microservices, which can increase the likelihood of collisions in sequential incremental keys.

One more reason to stay away from microservices, if possible.

bardsore5mo ago

Always try to avoid having two services using the same DB. Only way I'd ever consider sharing a DB is if only one service will ever modify it and all others only read.

grim_io5mo ago

Good luck enforcing that :)

mrkeen5mo ago

The 'collision' is two service classes both trying to use one db.

If you separate them (i.e. microservices) the they no longer try to use one db.

grim_io5mo ago

There is nothing stopping multiple microservices from using the same DB, so of course this will happen in practice.

Sometimes it might even be for a good reason.

impoppy5mo ago

This is such a mediocre article. It provides plenty of valid reasons to consider avoiding UUID in databases, however, it doesn’t say what should be used should one want primary keys that are not easy to predict. The XOR alternative is too primitive and, well, whereas I get why should I consider avoiding UUID, then what should I use instead?

mgoetzke5mo ago

That depends a lot on many factors and thus I dont like generic statements like that which tend to be more focused on a specific database pattern. That said everyone should indeed be aware of the potential tradeoffs.

And of course we could come up with many ways to generate our own ids and make them unique, but we have the following requirements.

- It needs to be a string (because we allow composing them to 'derive' keys) - A client must be able to create them (not just a server) without risk for collisions - The time order of keys must not be guessable easily (as the id is often leaked via references which could 'betray' not just the existence of a document, but also its relative creation time wrt others). - It should be easy to document how any client can safely generate document ids.

The lookup performance is not really such a big deal for us. Where it is we can do a projection into a more simple format where applicable.

BartjeD5mo ago

Personally my approach has been to start with big-ints and add a GUID code field if it becomes necessary. And then provide imports where you can match objects based on their code, if you ever need to import/export between tenants, with complex object relationships.

But that also adds complexity.

parpfish5mo ago

Two things I don’t like about big-int indexes:

- If you use uuids as foreign keys to another table, it’s obvious when you screw up a join condition by specifying the wrong indices. With int indices you can easily get plausible looking results because your join will still return a bunch of data

- if you’re debugging and need to search logs, having a simple uuid string is nice for searching

deepsun5mo ago

I still don't understand why people don't remove the hyphens from UUIDs. Hyphens makes it harder to copy-paste IDs. The only reason to keep them is to make it explicit "hey this is an UUID", otherwise it's a completely internal affair.

Even worse, some tools generate random strings and then ADD hyphens in them to look like UUID (even thought it's not, as the UUID version byte is filled randomly as well), cannot wrap my head why, e.g:

https://github.com/VictoriaMetrics/VictoriaLogs/blob/v1.41.0...

jwr5mo ago

"if you use PostgreSQL"

(in the scientific reporting world this would be the perennial "in mice")

hyperpape5mo ago

The thing is, none of us are mice, but many of us use Postgres.

It would be the equivalent of "if you're a middle-aged man" or "you're an American".

P.S. I think some of the considerations may be true for any system that uses B-Tree indexes, but several will be Postgres specific.

orthoxerox5mo ago

It's not just Postgres or even OLTP. For example, if you have an Iceberg table with SCD2 records, you need to regularly locate and update existing records. The more recent a record is, the more likely it is to be updated.

If you use UUIDv7, you can partition your table by the key prefix. Then the bulk of your data can be efficiently skipped when applying updates.

andatki5mo ago

Good addition!

kijin5mo ago

The space requirement and index fragmentation issue is nearly the same no matter what kind of relational database you use. Math is math.

Just the other day I delivered significant performance gains to a client by converting ~150 million UUIDv4 PKs to good old BIGINT. They were using a fairly recent version of MariaDB.

splix5mo ago

I think the author means all dbs that fit a single server. Because in distributed dbs you often want to spread the load evenly over multiple servers.

esafak5mo ago

To spell it out: it improves performance by avoiding hot spots.

zelphirkalt5mo ago

If they can live with making keys only in one place, then sure, this can work. If however they need something that is very highly likely unique, across machines, without the need to sync, then using a big integer is no good.

if they can live with MariaDB, OK, but I wouldn't choose that in the first place these days. Likely Postgres will also perform better in most scenarios.

kijin5mo ago

Yeah, they had relatively simple requirements so BIGINT was a quick optimization. MariaDB can guarantee uniqueness of auto-incrementing integers across a cluster of several servers, but that's about the limit.

Had the requirements been different, UUIDv7 would have worked well, too, because fragmentation is the biggest problem here.

miiiiiike5mo ago

The article is muddled, I wish he'd split it into two. One for UUID4 and another for UUID7.

I was using 64-bit snowflake pks (timestamp+sequence+random+datacenter+node) previously and made the switch to UUID7 for sortable, user-facing, pks. I'm more than fine letting the DB handle a 128-bit int vs over a 64-bit int if it means not having make sure that the latest version of my snowflake function has made it to every db or that my snowflake server never hiccups, ever.

Most of the data that's going to be keyed with a uuid7 is getting served straight out of Redis anyway.

deathanatos5mo ago

> Are UUIDs secure?

> Misconceptions: UUIDs are secure

> One misconception about UUIDs is that they’re secure. However, the RFC describes that they shouldn’t be considered secure “capabilities.”

> From RFC 41221 Section 6 Security Considerations:

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities

This is just wrong, and the citation doesn't support it. You're not guessing a 122-bit long random identifier. What's crazy is that the article, immediately prior to this, even cites the very math involved in showing exactly how unguessable that is.

… the linked citation (to §4.4, which is different from the in-prose citation) is just about how to generate a v4, and completely unrelated to the claim. The prose citation to §6 is about UUIDs generally: the statement "Do not assume that [all] UUIDs are hard to guess" is not logically inconsistent with properly-generated UUIDv4s being hard to guess. A subset of UUIDs have security properties, if the system generating & using them implements those properties, but we should not assume all UUIDs have that property.

Moreover, replacing an unguessable UUID with an (effectively random) 32-bit integer does make it guessable, and the scheme laid out seems completely insecure if it is to be used in the contexts one finds UUIDv4s being an unguessable identifier.

The additional size argument is pretty weak too; at "millions of rows", a UUID column is consuming an additional ~24 MiB.

hk13375mo ago

My biggest thing for UUIDs is don’t UUID everything. Most things should be okay with just regular integers as PKs.

p1necone5mo ago

Being able to create something and know the id of it before waiting for an http round trip simplifies enough code that I think UUIDs are worth it for me. I hadn't really considered the potential perf optimization from orderable ids before though - I will consider UUID v7 in future.

andatki5mo ago

Great!

reactordev5mo ago

I fun trick I did was generate UUID-like ids. We all can identify a UUIDv4 most of the time by looking at one. "Ah, a uuid" we say to ourselves. A little over a decade ago I was working on a massive cloud platform and rather than generate string keys like the author above suggested (int -> binary -> base62 str) we opted for a more "clever" approach.

The UUID is 128bits. The first 64bits are a java long. The last 64bits are a java long. Let's just combine the Tenant ID long with a Resource ID long to generate a unique id for this on our platform. (worked until it didn't).

nrhrjrjrjtntbt5mo ago

Atlassian settles for longer "ARIs" for this (e.g.https://developer.atlassian.com/cloud/guard-detect/developer...) composed of guids which allow a scheme like the Amazon ARN to pass around.

reactordev5mo ago

yeah, the problem for us was the resource id. What id was it? Was it a post? an upload? a workspace? it wasn't nearly as descriptive as we needed it to be.

moomoo115mo ago

I’ll do it regardless because any time I’ve tried to chase optimization early like this, hardware has always evolved faster.

We have all faced issues where we don’t know where the data will ultimate live that’s optimal for our access patterns.

Or we have devices and services doing async operations that need to sync.

I’m not working on mission critical “if this fails there’s a catastrophic event” type shit. It’s just rent seeking SaaS type shit.

Oh no it cost $0.35 extra to make $100. Next year will make more money relative to cost increase.

AtNightWeCode5mo ago

The counter argument I would say is that having all these integer ids comes with many problems. You can't make em public cause they leak info. They are not unique across environments. Meaning you have to spin up a lot of bs envs to just run it. But retros are for complaining about test envs, right?

Uuid4 are only 224bits is a bs argument. Such a made up problem.

But a fair point is that one should use a sequential uuid to avoid fragmentation. One that has a time part.

kgeist5mo ago

Some additional cases we encounter quite often where UUIDs help:

- A client used to run our app on-premises and now wants to migrate to the cloud.

- Support engineers want to clone a client’s account into the dev environment to debug issues without corrupting client data.

- A client wants to migrate their account to a different region (from US to EU).

Merging data using UUIDs is very easy because ID collisions are practically impossible. With integer IDs, we'd need complex and error-prone ID-rewriting scripts. UUIDs are extremely useful even when the tables are small, contrary to what the article suggests.

andatki5mo ago

If merging or moving data between environments is a regular occurrence, I agree it would be best to have non-colliding primary keys. I have done an environment move (new DB in different AWS region) with integers and sequences for maybe a 100 table DB and it’s do-able but a high cost task. At that company we also had the demo/customer preview environment concept where we needed to keep the data but move it.

Lucasoato5mo ago

Hi, a question for you folks. What if I don’t like to embed timestamp in uuid as v7 do? This could expose to timing attacks in specific scenarios.

Also is it necessary to show uuid at all to customers of an API? Or could it be a valid pattern to hide all the querying complexity behind named identifiers, even if it could cost a bit in terms of joining and indexing?

The context is the classic B2B SaaS, but feel free to share your experiences even if it comes from other scenarios!

lwhi5mo ago

Wouldn't you need to expose UUID if you want to make use of optimistic locking?

Lucasoato5mo ago

I feel that this is among the good reasons to keep exposing UUID in the API.

deepsun5mo ago

> For UUID v4s, primary key values in B-Tree indexes are problematic.

Wrong. Don't use B-Tree for random indexes, there's HASH index exactly for this:

    CREATE INDEX [index_name] ON [table_name] USING HASH ([column_name]);

henning5mo ago

UUIDs make enumeration attacks harder and also prevent situations where seeing a high valid ID value lets you estimate how much money a private company is earning if they charge based on the object the ID is associated with. If you can sample enough object ID values and see when the IDs were created, you could reverse engineer their ARR chart and see whether they're growing or not which many companies want to avoid.

m0005mo ago

Why not just use UUIDs as a unique column next to a bigint PK?

The power and main purpose of UUIDs is to act as easy to produce, non-conflicting references in distributed settings. Since the scope of TFA is explicitly set to be "monolithic web apps", nothing stops you from having everything work with bigint PKs internally, and just add the UUIDs where you need to provide external references to rows/objects.

mrkeen5mo ago

Yes, if you're in the group of developers who are passionate about db performance, but have ruled out the idea of spreading work out to multiple DBs, then continuing to use sequential IDs is fine.

kaladin_15mo ago

I really hoped the author would discuss alternatives for distributed databases that writes in parallel. Sequential key would be atrocious in such circumstance this could kill the whole gain of distributed database as hotspots would inevitably appear.

I would like to hear from others using, for example, Google Spanner, do you have issues with UUID. I don't for now, most optimizations happen at the Controller level, data transformation can be slow due to validations. Try to keep service logic as straightforward as possible.

burnt-resistor5mo ago

UUID PKs are trying to solve the wrong problem. Integer/serial primary keys are not the problem so long as they're never exposed or usable externally. A critical failure of nearly every RESTful framework is exposing internal database identifiers rather than using encrypted ones preserving relative performance, creation order-preservation, and eliminating unowned key probing.

mkleczek5mo ago

That's really an important deficiency of Postgres.

Hash index is ideally suited for UUIDs but for some reason Postgres hash indexes cannot be unique.

mcsoft5mo ago

Snowflake or sonyflake ids work:

https://en.wikipedia.org/wiki/Snowflake_ID

https://github.com/sony/sonyflake?tab=readme-ov-file

phendrenad25mo ago

You probably don't want integer primary keys, and you probably don't want UUID primary keys. You probably want something in-between, depending on your use case. UUID is one extreme on this spectrum, which tries to solve all of the problems, including ones you might not have.

pil0uOP5mo ago

What's in-between? I posted the article because I'm in the middle of that choice and wanted to generate discussion/contradiction.

So far, people have talked a lot about UUIDs, so I'm genuinely curious about what's in-between.

phendrenad25mo ago

An example would be YouTube's video IDs. It's custom-fit for a purpose (security: no, avoiding the problem where people fish for auspicious YouTube video numbers or something: yes).

Another example would be a function that sorts the numbers 0 through 999 in a seemingly random order (but's actually deterministic), and then repeat that for each block of 1000 with a slight shift. Discourages casual numeric iteration but isn't as complex or cryptographically secure as UUID.

kortex5mo ago

I've been using ULIDs [0] in prod for many years now, and I love them. I just use string encoding, though if I really wanted to squeeze out every last MB, I could do some conversion so it is stored as 16 bytes instead of 26 chars. In practice it's never mattered, and the simplicity of just string IDs everywhere is nice.

Sometimes I have to talk to legacy systems, all my APIs have str IDs, and I encode int IDs as just decimal left padded with leading zeros up to 26 chars. Technically not a compliant ULID but practically speaking, if I see leading `00` I know it's not an actual ULID, since that would be before Nov-2004, and ULID was invented in 2017. The ORM automatically strips the zeros and the query just works.

I'm just kind of over using sequential int IDs for anything bigger than hobby level stuff. Testing/fixturing/QA are just so much easier when you do not have to care about whether an ID happens to already exist.

[0] https://github.com/ulid/spec

sedatk5mo ago

See https://news.ycombinator.com/item?id=46211578

kortex5mo ago

The python implementation I use doesn't do this quirk. It's just timestamp + randomness in Crockford Base32. That's all I need. Sure it doesn't fully "comply with the spec" but frankly the sequence sub-millis quirk was a complete mistake.

scary-size5mo ago

This reminds me about this old gist for generating Firebase-like "push IDs" [1]. Those have some nicer properties.

[1] https://gist.github.com/mikelehen/3596a30bd69384624c11

whateveracct5mo ago

Sometimes its nice for your PK to be uniformly distributed. As a reader, even if it hurts as a writer. For instance, you can easily shard queries and workloads.

> the impact to inserts and retrieval of individual items or ranges of values from the index.

Classic OLTP vs OLAP.

Jeff_Brown5mo ago

If you're really ambitious you'll use two UUIDs for the ID, because for an app in which at least a billion people have at least 327 million random v4 UUIDs, the probability of a collision will be greater than 1%.

p0w3n3d5mo ago

What about newest postgresql support for uuidv7? Anybody did tests? This is what we're heading towards at the moment of writing so I'd like to ask to eventually roll back the decision

cruffle_duffle5mo ago

I never understood the arguments against using using globally unique ids. For example how it somehow messes up indexes. I’m not a CS major but those are typically b-trees are they not? If you have a primary key whose generation is truly random such that each number is equally likely, then that b-tree is always going to be balanced.

Yes there are different flavors of generating them with their own pros and cons, but at the end of the day it’s just so much more elegant than some auto incrementing crap your database creates. But that is just semantic, you can always change the uuid algorithm for future keys. And honestly if you treat the uuid as some opaque entity (which you should), why not just pick the random one?

And I just thought of the argument that “but what if you want to sort the uuid…” say it’s used for a list of stories or something? Well, again… if you treat the uuid as opaque why would you sort it? You should be sorting on some other field like the date field or title or something. UUIDs are opaque, damn it. You don’t sort opaque data. “Well they get clustered weird” say people. Why are you clustering on a random opaque key? If you need certain data to be clustered, then do it on the right key (user_id field did your data was to be clustered by user, say)

Letting the client generate the primary keys is really liberating. Not having to care about PK collisions or leaking information via auto incrementing numbers is great!

In my opinion uuid isn’t used enough!

chuckadams5mo ago

> If you have a primary key whose generation is truly random such that each number is equally likely, then that b-tree is always going to be balanced.

Balanced and uniformly scattered. A random index means fetching a random page for every item. Fine if your access patterns are truly random, but that's rarely the case.

> Why are you clustering on a random opaque key?

InnoDB clusters by the PK if there is one, and that can't be changed (if you don't have a PK, you have some options, but let's assume you have one). MSSQL behaves similarly, but you can override it. If your PK is random, your clustering will be too. In Postgres, you'll just get fragmented indexes, which isn't quite as bad, but still slows down vacuum. Whether that actually becomes a problem is also going to depend on access patterns.

One shouldn't immediately freak out over having a random PK, but should definitely at least be aware of the potential degradation they might cause.

cruffle_duffle5mo ago

I feel, honestly, like while you are indeed correct for most cases it’s absolutely fine to use some flavor of uuid. I feel like the benefits outweighs the cost in most cases.

chuckadams5mo ago

Sure, and for many cases, uuidv7 is that flavor. It just comes with a timestamp, which may or may not be an issue. It isn't an issue for me, which is why I use it myself.

old8man5mo ago

Very useful article, thank you! Many people suggest CUID2, but it is less efficient and is better used for frontend/url encoding. For backend/db, only UUID v7 should be used.

p2detar5mo ago

Another interesting article from Feb-2024 [0] where the cost of inserting a uuid7() and a bigint is basically the same. To me it wasn't quite clear what the problem with the buffer cache is but the author makes it much more clear than OP's article:

> We need to read blocks from the disk when they are not in the PostgreSQL buffer cache. Conveniently, PostgreSQL makes it very easy to inspect the contents of the buffer cache. This is where the big difference between uuidv4 and uuidv7 becomes clear. Because of the lack of data locality in uuidv4 data, the primary key index is consuming a huge amount of the buffer cache in order to support new data being inserted – and this cache space is no longer available for other indexes and tables, and this significantly slows down the entire workload.

0 - https://ardentperf.com/2024/02/03/uuid-benchmark-war

cebert5mo ago

Using UUIDs as primary keys in non-relational databases like DynamoDB is valid and doesn’t raise the concerns mentioned in the article.

andatki5mo ago

Good point that the post should be made clear it’s referring only to my experience with Postgres.

Xorakios5mo ago

I was born in the Vancouver Social Security office #538 area in 1962.

Guess how much I have to defend against attackers trying 538-62-xxxx

Waterluvian5mo ago

Any decent resources with benchmark data on Postgres insertion, indexing, retrieve, etc. for UUID vs. integer based PKs?

maxnoe5mo ago

Would this argument still apply if I need to store a uuidv4 anyway in the table?

And I'd likely want a unique constraint on that?

nesarkvechnep5mo ago

If we embraced REST, as Roy Fielding envisioned it, we wouldn't have this, and all similar, conversations. REST doesn't expose identifier, it only exposes relationships. Identifiers are an implementation details.

raxxorraxor5mo ago

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities

The issue is that is true for more or less all capability URLs. I wouldn't recommend UUIDs per se here, probably better to just use a random number. I have seen UUIDs for this in practice though and these systems weren't compromised because of that.

I hate the tendency that password recovery flows for example leave the URL valid for 5 minutes. Of course these URLs need to have a limited life time, but mail isn't a real time communication medium. There is very little security benefit from reducing it from 30 minutes to 5 minutes for example. You are not getting "securer" this way.

waynenilsen5mo ago

What kills me is I can’t double click the thing to select it.

mrits5mo ago

This application specific. iTerm2 doesn't break up by - why firefox does.

socketcluster5mo ago

My advice is: Avoid Blanket Statements About Any Technology.

I'm tired of midwit arguments like "Tech X is N% faster than tech Y at performing operation Z. Since your system (sometimes) performs operation Z, it implies that Tech X is the only logical choice in all situations!"

It's an infuriatingly silly argument because operation Z may only represent about 10% of the total CPU usage of the whole system (averaged out)... So what is promoted as a 50% gain may in fact be a 5% gain when you consider it in the grand scheme of things... Negligible. If everyone was looking at this performance 'advantage' rationally; nobody would think it's worth sacrificing important security or operational properties.

I don't know what happened to our industry; we're supposed to be intelligent people but I see developers falling for these obvious logical fallacies over and over.

I remember back in my day, one of the senior engineers was discussing upgrading a python system and stated openly that the new version of the engine was something like 40% slower than the old version but he didn't even have to explain himself why upgrading was still a good decision; everybody in the company knew he was only talking about the code execution speed and everybody knew that this was a small fraction of the total.

Not saying UUIDv7 was a bad choice for Postgres. I'm sure it's fine for a lot of situations but you don't have to start a cult preaching the gospel of The One True UUID to justify your favorite project's decisions.

I do find it kind of sly though how the community decided to make this UUIDv7 instead of creating a new standard for it.

The whole point of UUID was to leverage the properties of randomness to generate unique IDs without requiring coordination. UUIDv7 seems to take things in a philosophically different path. People chose UUID for scalability and simplicity (both of which you get as a result of doing away with the coordination overhead), not for raw performance...

That's the other thing which drives me nuts; people who don't understand the difference between performance and scalability. People foolishly equate scalability with parallelism or concurrency; whereas that's just one aspect of it; scalability is a much broader topic. It's the difference between a theoretical system which is fast given a certain artificially small input size and one which actually performs better as the input size grows.

Lastly; no mention is made about the complex logic which has to take place behind the scenes to generate UUIDv7 IDs... People take it for granted that all computers have a clock which can produce accurate timestamps where all computers in the world are magically in-sync... UUIDv7 is not simple; it's very complicated. It has a lot of additional complexity and dependencies compared to UUIDv4. Just because that complexity is very well hidden from most developers, doesn't mean it's not there and that it's not a dependency... This may become especially obvious as we move to a world of robotics and embedded systems where cheap microchips may not have enough Flash memory to hold the code for the kinds of programs required to compute such elaborate IDs.

thraxil5mo ago

Yep. We have tables that use UUIDv4 that have 60M+ rows and don't have any performance problems with them. Would some queries be faster using something else? Probably, but again, for us it's not close to being a bottleneck. If it becomes a problem at 600M or 6B rows, we'll deal with it then. We'll probably switch to UUIDv7 at some point, but it's not a priority and we'll do some tests on our data first. Does my experience mean you should use UUIDv4? No. Understand your own system and evaluate how the tradeoffs apply to you.

ezekg5mo ago

I have tables that have billions of rows that use UUIDv4 primary keys and I haven't encountered any issues either. I do use UUIDv7 for write-heavy tables, but even then, I got a way bigger performance boost from batching inserts than switching from UUIDv4 to UUIDv7. Issue is way overblown.

p2detar5mo ago

Nice feedback. Out of curiosity, have you made any fine-tuning to psql that greatly improved performance?

thraxil5mo ago

Nope. Out of the box GCP Cloud SQL instance.

kunley5mo ago

Wasn't choosing uuids as ids falling for the deceptive argument in the first place?

christophilus5mo ago

Not really, no. They’re very convenient for certain problems and work really well in general. I’ve never had a performance issue where the problem boiled down to my use of UUID.

danparsonson5mo ago

You never having seen the problem doesn't mean it never happens; I have dealt with a serious performance problem in the past that was due to excessive page fragmentation due to a GUID PK.

To your original point, these are heuristics; there isn't always time to dig into every little architectural decision, so having a set of rules of thumb on hand helps to preempt problems at minimal cognitive cost. "Avoid using a GUID as a primary key if you can" is one of mine.

kunley5mo ago

What are these certain problems, if I may ask?

1 more reply

tom_m5mo ago

It depends. What a narrow minded article, sigh. Bring on the vibe coding! Let's just be mindless.

dizlexic5mo ago

Meh, uuid go brrrrrrrrrrrrrrrrrrrrrrrrr

sneak5mo ago

This is why ULID exists and why I use them in my ext_id columns. For the actual relational IDs internal to the db I use smaller/faster data types.

j / k navigate · click thread line to collapse

449 comments

vintermann5mo ago

A prime example of premature optimization.

ralferoo5mo ago

nonethewiser5mo ago

You still have that problem from organic birthdays and also the problem of needing to change ids to correct birth dates.

jonny_eh5mo ago

To add to that, birthdays can clump, just like any seemingly "random" data.

2 more replies

tyre5mo ago

And then have to enter/handle a non-date through all systems? How do you know if this non-dated person is over the age of minority? Eligible for a pension?

Maybe the answer is to evenly spread the defaults over 365 days.

ralferoo5mo ago

If you don't know their birthday, you can presumably never answer that question in any case.

Again, if you don't know the birth year of someone, you would have no way of knowing their age. I'm not sure that means that the general policy of putting a birthday into their ID number is flawed.

Presumably the governments that choose to embed the date into a national ID number do so because it's more useful for their purposes to do so than just assigning everyone a random number.

1 more reply

PunchyHamster5mo ago

> To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.

well, till you run out of numbers for the immigrants that don't have exact birth date

OptionOfT5mo ago

Belgium's national register number is similar:

YY.MM.DD-AAA.BB

In either the AAA or BB component there is something about the gender.

But it does mean that there is a limit of people born per day of a certain gender.

hyperman15mo ago

1 more reply

croes5mo ago

> If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits.

You can already feel the disaster rising because sone program expects always the latter.

And it doesn’t fix the problem, it just makes it less likely.

tacone5mo ago

Fantastic real life example. Italian PNs carry also the gender, which something you can change surgically, and you'll eventually run into the issue when operating at scale.

mckirk5mo ago

I'm not sure whether that was intended, but 'operating at scale' actually made me laugh out loud :D

benterix5mo ago

I have to admit an unintended chuckle, too.

cozyman5mo ago

how does one change their gender surgically?

delichon5mo ago

You can't, but since gender isn't defined by anything physical, there's no need.

1 more reply

WalterSlovotsky5mo ago

The preferred method would be gender affirming surgery.

1 more reply

pelheit5mo ago

I think he means attempting to mimic some features of the opposite sex, with procedures like penile inversion and so on.

barrkel5mo ago

Uuid v7 just has a bias in its generation; it isn't carrying information. You're not going to try and extract a timestamp from a uuid.

Random vs time biased uuids are not a decision to shave off ms that you will regret.

Most likely they will be a decision that shaves off seconds (yes, really - especially when you consider locality effects) and you'll regret nothing.

duckerude5mo ago

You can choose to never make use of that property. But it's tempting.

voidnap5mo ago

I made a service using something like a 64 bit wide ULID but there was never a presumption that data is be inserted or updated earlier than the most recent record.

You're gonna have problems "backdating" if you mix up time of insertion with when the event you model actually ocurred. Like id you treat those as the same thing when they aren't.

tobyhinloopen5mo ago

> You're not going to try and extract a timestamp from a uuid.

jandrewrogers5mo ago

> You're not going to try and extract a timestamp from a uuid.

Hyrum's Law suggests that someone will.

ncruces5mo ago

> You're not going to try and extract a timestamp from a uuid.

So, random library: https://pkg.go.dev/github.com/google/uuid#UUID.Time

> Time returns the time in 100s of nanoseconds since 15 Oct 1582 encoded in uuid. The time is only defined for version 1, 2, 6 and 7 UUIDs.

bri3d5mo ago

> You're not going to try and extract a timestamp from a uuid.

What? The first 48 bits of an UUID7 are a UNIX timestamp.

I see what you’re getting at, that ignoring the timestamp aspect makes them “just better UUIDs,” but this ignores security implications and the temptation to partition by high bits (timestamp).

nine_k5mo ago

Nobody forces you to use a real Unix timestamp. BTW the original Unix timestamp is 32 bits (expiring in 2038), and now everyone is switching to 64-bit time_t. What 48 bits?

1 more reply

hnfong5mo ago

scottlamb5mo ago

* If/when you do switch to a distributed database, you don't always need to rekey everything:

** You can spread existing keys across shards via hashing on lookup or reversing bits. Some databases (e.g. DynamoDB) actually force this.

mkleczek5mo ago

The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

Ukv5mo ago

The identifier is still connected to the user's data, just through the appropriate other fields in the table as opposed to embedded into the identifier itself.

Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.

You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

marcus_holmes5mo ago

> You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.

This is the key point, I think. Searching is not the same as identifying.

mkleczek5mo ago

> Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.

So what is such an identifier for? Is it only for some technical purposes (like replication etc.)?

Why bother with UUID at all then for internal identifiers? Sequence number should be enough.

2 more replies

everforward5mo ago

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

mkleczek5mo ago

If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.

The whole discussion is about externally visible identifiers (ie. identifiers visible to external software, potentially used as a persistent long-term reference to your data).

The key issue here is that you try to model identifiable "entities" in your data model, while it is much better to model "captured information".

2 more replies

brettgriffin5mo ago

The surrogate key's purpose isn't to directly store the natural key's information, rather, it's to provide an index to it.

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

mkleczek5mo ago

That's my whole point: either X becomes a "real world artifact" or it is useless as identifier.

1 more reply

PunchyHamster5mo ago

Identifier is just "a piece of common token system can use to operate on same entity.

So an unique ID attached to few other parts to identify current iteration of individual is the best we have, and the best we will get.

vrighter5mo ago

mkleczek5mo ago

Network protocol designers know better and by default embed protocol version number in message format spec.

I guess you can assign 3-4 bits for identifier version number as well.

And yes - for long living data dealing with compatibility issues is inevitable so you have to take that into account from the very beginning.

1 more reply

groundzeros20155mo ago

You can’t design something by trying to anticipate all future changes. things will change and break.

In my personal design sense, I have found keeping away generality actually helps my code last longer (based on more concrete ideas) and easier to change when those days come.

1 more reply

hyperpape5mo ago

Your comment is sufficiently generic that it’s impossible to tell what specific part of the article you’re agreeing with, disagreeing with, or expanding upon.

vintermann5mo ago

I disagree that performance should be a reason to choose running numbers over guids until you absolutely have to.

I think IDs should not carry information. Yes, that also means I think UUIDv7 was wrong to squeeze a creation date into their ID.

Isn't that clear enough?

mcny5mo ago

So, the fact that there is a date in the uuidv7 does not extend any meaning or significance to the record outside of the database. To infer such a relationship where none exists is the error.

3 more replies

hyperpape5mo ago

Those are two unrelated points and the connection between them was unclear in the original post.

hxtk5mo ago

I don't think I should ignore what I already know and intentionally pessimize the first draft in the name of avoiding premature optimization.

barrkel5mo ago

UUID v7 doesn't squeeze creation date in. If you treat it as anything other than a random sequence in your applications, you're just wrong.

1 more reply

anamexis5mo ago

For what it’s worth, it was also completely unclear to me how you were responding to the article itself. It does not discuss natural keys at all.

GuB-425mo ago

I don't think the timestamped UUIDs are "carrying data", it is just a heuristic to improve lookup performance. If the timestamp is wrong, it will just run as slow as the non-timestamped UUID.

moralestapia5mo ago

>and you can use that for load balancing

As long as you're not in China or India around specific years ...

GP's point stands strong.

delecti5mo ago

benterix5mo ago

Your comment is valid but is not related to the article.

spoiler5mo ago

More broadly, this is the ages old surrogate vs natural key discussion, but yes the comment completely misses the point of the article. I can only assume they didn't read it in full!

vintermann5mo ago

The article explicitly argues against the use of GUIDs as primary keys, and I'm arguing for it.

A running number also carries data. Before you know it, someone's relying on the ordering or counting on there not being gaps - or counting the gaps to figure out something they shouldn't.

3 more replies

oncallthrow5mo ago

vintermann5mo ago

danudey5mo ago

Perhaps you can clarify something for me, because I think I'm missing it.

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits

So presumably the format is DDMMYYXXXXX (for some arbitrary number of X's), where the XXX represents e.g. an automatically incrementing number of some kind?

Which means that if it's DDMMYYXXX then you can only have 1000 people born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000 people born on DDMMYY.

So in order for there to be so many such entries in common that people are denied use of their actual birthday, then one of the following must be true:

1. The XXX counter must be extremely small, in order for it to run out as a result of people 'using up' those Jan 1 dates each year

2. The number of people born on Jan 1 or immigrating to Norway without knowledge of their birthday must be colossal

(Happy to be educated, as I must be missing something here)

cycomanic5mo ago

Like the other poster said, this is a problem with default values not encoding the birthday into the personnummer.

guhcampos5mo ago

It's not that bad. Brazilian CPF are 11 numbers and everyone remembers them. You just get use to it =)

sgarland5mo ago

> Permanent identifiers should not carry data.

Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.

> A prime example of premature optimization.

Disagree. Data is sticky, and PKs especially so. Moreover, if you’re going to spend time optimizing anything early on, it should be your data model.

> Don't make decisions you will regret just to shave off a couple of milliseconds!

Postgres suffers a different but similar problem with its visibility map lookups.

liuliu5mo ago

I read it (and regret it is a waste of my time). Their arguments are:

* integer keys are faster;

* uuidv7 keys are faster;

* if you want obfuscated keys, using integer and do some your own obfuscation (!!!).

Note that in this entire article, the recommendation is not about using natural keys (email address, some composite of user identification etc.), so I am skipping that whole discussion.

sgarland5mo ago

You can hand out chunks of sequential ids from a central coordinator to avoid collision; this is a well-established pattern.

Re: natural keys (or something like it), I was using it as an example of how badly PK choice can impact performance at scale.

1 more reply

PunchyHamster5mo ago

> (with the trade-off, of course, on stronger guessability).

you're not guessing 2^72 bit random number. And if guessing UUID does something in your app, you already fucked up

lukeschlather5mo ago

> Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.

alerighi5mo ago

1 more reply

PunchyHamster5mo ago

The cause is more just "not having enough bits". UUID is 128 bit. You're not running out even if you use part for timestamp, the random part will be big enough.

Like, it's a valid complaint.. just not for discussion at hand.

Also, we do live in reality and while having entirely random one might be perfect from theory of data, in reality having it be prefixed by date have many advantages performance wise.

> Permanent identifiers should not carry data. This is like the cardinal sin of data management

Traubenfuchs5mo ago

> Norwegian PNs have your birth date

Same with Austrian social security numbers, which, in somes cases, don't contain the persons birth date and in some cases don't contain any existing date at all.

Yet many websites enforce a valid date and pull the persons birthdate from it...

mzi5mo ago

> Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right?

PRI could of course have been a UUID instead.

oblio5mo ago

> Well, wrong, since although the date doesn't change.

Someone should have told Julius Caesar and Gregory XIII that :-p

scottlamb5mo ago

> Permanent identifiers should not carry data.

[1] https://news.ycombinator.com/item?id=46272985

[2] https://news.ycombinator.com/item?id=46273325

naasking5mo ago

scottlamb5mo ago

1 more reply

patmorgan235mo ago

Insert order or time is information. And if you depend on that information you are going to be really disappointed when back dated records have to be inserted.

scottlamb5mo ago

Right, to ensure your clients don't depend on that information, make the key opaque outside the database through methods such as the ones dfox and formerly_proven suggested, as I said.

maxbond5mo ago

> Permanent identifiers should not carry data.

Do you have the same criticism for serial identifiers? How about hashes? What about the version field in UUIDs?

asah5mo ago

counterpoint: IRL, data values in a system like PostgreSQL are padded to word boundaries so either you're wasting bits or "carrying data."

stickfigure5mo ago

HendrikHensen5mo ago

In Google Cloud Bigtable we had the issue that our domain's primary key was a sequential integer autogenerated by another app. So we just reversed it, and it distributed automatically quite nicely.

bruckie5mo ago

Doing this (reversing the bits of sequential ids) is common guidance inside Google for its distributed datastores like Spanner and Bigtable.

dap5mo ago

It is, although you can have sharded PostgreSQL, in which case I agree with your assessment that you want random PKs to distribute them.

hedora5mo ago

whizzter5mo ago

Even in a distributed database you want increasing (even if not monotonic) keys since the underlying b-tree or whatever will very likely behave badly for entirely random data.

UUIDv7 is very useful for these scenarios since

A: A hash or modulus of the key will be practically random due to the lower bits being random or pseudo-random (ie distributes well between nodes)

B: the first bits are sortable.. thus the underlying storage on each node won't go bananas.

dspillett5mo ago

kortex5mo ago

jakeydus5mo ago

I think they address this in the article when they say that this advice is specific to monolithic applications, but I may be misremembering (I skimmed).

andy_ppp5mo ago

Are you saying a monolith cannot use a distributed database?

jakeydus5mo ago

I'm not making any claims at all, I was just adding context from my recollection of the article that appeared to be missing from the conversation.

Edit: What the article said: > The kinds of web applications I’m thinking of with this post are monolithic web apps, with Postgres as their primary OLTP database.

So you are correct that this does not disqualify distributed databases.

PunchyHamster5mo ago

UUIDv7 is not monotonic tho

nonethewiser5mo ago

>In Postgres random PKs are bad

Why?

nutjob25mo ago

> You want to distribute load across the keyspace so you avoid hot shards.

This is just another case of keys containing information and is not smart.

The obvious solution is to have a field that drives distribution, allowing rebalancing or whatever.

stickfigure5mo ago

This is something you should discuss with the developers at Cockroach Labs, Google Cloud, et al.

As a consumer of these databases we're stuck with them as designed, which means we have to worry about key distribution.

conradludgate5mo ago

100%. You can use rendezvous hashing to determine the shard(s). The hash of a sequence should be randomly distributed as changing the LSB should propagate to 50% change in the output bits.

benterix5mo ago

mort965mo ago

I tend to avoid UUIDv7 and use UUIDv4 because I don't want to leak the creation times of everything.

nbadg5mo ago

I've read people suggest using a UUIDv7 as the primary key and a UUIDv4 as a user-visible one as a remedy.

I'm still not sure either way if I like the idea, but it's certainly not the craziest thing I've ever heard.

thayne5mo ago

But if you are doing that, why not just use an incrementing integer instead of a uuidv7?

1 more reply

throw0101a5mo ago

> I tend to avoid UUIDv7 and use UUIDv4 because I don't want to leak the creation times of everything.

See perhaps "UUIDv47 — UUIDv7-in / UUIDv4-out (SipHash‑masked timestamp)":

* https://github.com/stateless-me/uuidv47

* Sept 2025: https://news.ycombinator.com/item?id=45275973

wongarsu5mo ago

If that kind of stuff is on the able you can also use boring 64bit integer keys and encrypt those (e.g. [1]). Which in the end is just a better thought out version of what the article author did.

UUIDv47 might have a space if you need keys generated on multiple backend servers without synchronization. But it feels very niche to me.

1: https://wiki.postgresql.org/wiki/XTEA_(crypt_64_bits)

cdmckay5mo ago

Out of curiosity, why is it an issue if you leak creation time?

bruce5115mo ago

The issue will be very context specific. In other words to (reasonably) answer the question, we'd have to judge each application individually.

1 more reply

mort965mo ago

2 more replies

natch5mo ago

kreetx5mo ago

E.g, if your service users have timestamp as part of the key and this data is visible to other users, you would know when that account was created. This could be an issue.

saaspirant5mo ago

There was a HN comment about competitors tracking how many new signups are happening and increasing the discounts/sales push based on that. Something like this.

2 more replies

dboreham5mo ago

robertlagrant5mo ago

Depends on the data. If you use a primary key in data about a person that shouldn't include their age (e.g. to remove age-based discrimination) then you are leaking an imperfect proxy to their age.

1 more reply

Bombthecat5mo ago

Admins, early users, founders, CEOs etc etc would have althe lowest creation time...

barrkel5mo ago

You shouldn't generally use PKs as public identifiers, least of all UUIDs, which are pretty user hostile.

mort965mo ago

I really don't see the issue with having a UUID in a URL.

formerly_proven5mo ago

(What do you think Youtube video IDs are?)

enz5mo ago

The problem with this approach is that you now have to manage a secret key/secret for a (maybe) a very long time.

I shared this article a few weeks ago, discussing the problems with this kind of approach: https://notnotp.com/notes/do-not-encrypt-ids/

I believe it can make sense in some situations, but do you really want to implement such crypto-related complexity?

formerly_proven5mo ago

benterix5mo ago

pdimitar5mo ago

> What do you think Youtube video IDs are?

I actually haven no idea. What are they?

(Also what is the format of their `si=...` thing?)

intalentive5mo ago

Can’t recall where I heard this, but I’m pretty sure the si=… is tracking information that associates the link with the user who shared it.

1 more reply

wpollock5mo ago

YouTube video ids are just integers in a base-64 encoding, modified to be URL safe.

1 more reply

Retr0id5mo ago

Why not use AES-128 by default? Your CPU has instructions to accelerate AES-128.

conradfr5mo ago

Can't you just change the starting value of your sequence?

stickfigure5mo ago

In Postgres I often like to use a single sequence for everything. It leaks some information yes but in a busy system it tends to be "obscure enough".

x0x05mo ago

It's not leaking that's the concern. It's that not having the names of objects be easily enumerable is a strongly security-enhancing feature of a system.

stickfigure5mo ago

Also, if most of your endpoints require auth, this is not typically a problem.

It really depends on your application. But yes, that's something to be aware of. If you need some ids to be unguessable, make sure they are not predictable :-)

2 more replies

gwbas1c5mo ago

I work on an application where we encrypt the integer primary key and then use the bytes to generate something that looks like a UUID.

In our case, we don't want database IDs in an API and in URLs. When IDs are sequential, it enables things like dictionary attacks and provides estimates about how many customers we have.

Encrypting a database ID makes it very obvious when someone is trying to scan, because the UUID won't decrypt. We don't even need a database round trip.

puilp05025mo ago

Few questions:

gwbas1c5mo ago

> How do you manage the key for encrypting IDs?

The same way we manage all other secrets in the application. (Summarized below)

---

Secrets (connection strings, encryption keys, ect,) are encrypted in the configuration file as base64 strings. The certificate to read/write secrets are stored in Azure Keyvault.

The startup logic in all applications is something like:

1: Determine the environment (production, qa, dev)

2: Get the appropriate certificate

3: Read the configuration files, including decrypting secrets (such as the primary key encryption keys) from the configuration files

4: Populate the strongly-typed objects that hold the configuration values

5: These objects are dependency-injected to runtime objects

sedatk5mo ago

That sounds quite troublesome if the encryption key is lost, compromised, or rotated for any other reason.

cowboylowrez5mo ago

gwbas1c5mo ago

The key will never be rotated. (There's no reason to.)

We're not worried about key compromises.

If the key is lost, we have much bigger problems.

dotancohen5mo ago

From the fine article:

  > Random values don’t have natural sorting like integers or lexicographic (dictionary) sorting like character strings. UUID v4s do have "byte ordering," but this has no useful meaning for how they’re accessed.

kreetx5mo ago

dotancohen5mo ago

If one needs timestamp ordering, then UUIDv7 is a good alternative.

But the author does not say timestamp ordering, he says ordering. I think he actually means and believes that there is some problem ordering UUIDv4.

kreetx5mo ago

marcosdumay5mo ago

> you use it when you need to generate unique keys in a distributed manner

Just to complement this with a point, but there isn't any mainstream database management system out there that is distributed on the sense that it requires UUIDs to generate its internal keys.

torginus5mo ago

Auto-incrementing keys can work, but what happens when you run out of integers? Also, distributed dbs probably make this hard, and they can't generate a key on client.

anarazel5mo ago

vbezhenar5mo ago

p1necone5mo ago

andatki5mo ago

torginus5mo ago

To be polite, I don't think this article rests on sound technical foundations.

mrinterweb5mo ago

How so?

K0nserv5mo ago

sgarland5mo ago

Yes, and for Postgres, it also causes WAL bloat due to the high likelihood of full page writes.

dotancohen5mo ago

Could you expand on this? I use postgres often and though I could have an LLM explain what you mean, I think I'd learn more to hear it from you. Thank you.

dagss5mo ago

Accessing data in totally random locations can be a performance issue.

Depends on lots of things ofc but this is the concern when people talk about UUID for primary keys being an issue.

hashmush5mo ago

Agree, I did a double take on this too.

Values of the same type can be sorted if a order is defined on the type.

It's also strange to contrast "random values" with "integers". You can generate random integers, and they have a "sorting" (depending on what that means though)

dev_l1x_be5mo ago

sagarm5mo ago

hans_castorp5mo ago

sgarland5mo ago

crest5mo ago

Any fixed sized bitstring has an obvious natural ordering, but since they're allocated randomly they lack the density and locality of sequential allocation.

iainctduncan5mo ago

Counterargument... I do technical diligence so I talk to a lot of companies at points of inflection, and I also talk to lots who are stuck.

Worth thinking about.

jessep5mo ago

iainctduncan5mo ago

Right, but if you start off with uuids and the expectation that you might use them to shard, you'll wind up factoring that into the data model. Retrofitting, as you rightly say, can be much harder.

jryio5mo ago

I also do technical diligence, often times the blocker to sharding is the target company's tenancy model and schema. PK data type is certainly a blocker.

UUIDs buy you flexibility, sure. But if your data model assumes everything lives in one database, the PK type is a sub-probem of your problems.

mrinterweb5mo ago

ivan_gammel5mo ago

sagarm5mo ago

ivan_gammel5mo ago

Yes, sure. My point is, it may never be necessary.

sagarm5mo ago

I think you're right that it won't matter for most companies. But having been at a company with persistent DB performance issues with UUIDv4 keys as a contributing factor, it sucks.

mrkeen5mo ago

If you have too many UUIDs, throw more DBs at the problem. Hopefully you haven't tied yourself to any architectural decisions that would limit you to only using a single database.

sgarland5mo ago

IME, when performance issues become obvious, the devs are in growth mode and have no desire / time to revisit PK choice.

Integer PKs were seen as fine for years - decades, even - before the rise of UUIDs.

can3p5mo ago

> For many business apps, they will never reach 2 billion unique values per table, so this will be adequate for their entire life. I’ve also recommended always using bigint/int8 in other contexts.

I'm sure every dba has a war story that starts with similar decision in the past

xandrius5mo ago

To summarise the article: in PG, prefer using UUIDv7 over UUIDv4 as they have slightly better performance.

If you're using latest version of PG, there is a plugin for it.

That's it.

sbuttgereit5mo ago

You might have missed the big H2 section in the article:

"Recommendation: Stick with sequences, integers, and big integers"

After that then, yes, UUIDv7 over UUIDv4.

"[!NOTE] As of Postgres 18, there is a built in uuidv7() function, however it does not include all of the functionality below."

What those features are and if this extension adds more cruft in PostgreSQL 18 than value, I can't tell. But I expect that the vast majority of users just won't need it any more.

tmountain5mo ago

Sticking with sequences and other integer types will cause problems if you need to shard later.

zwnow5mo ago

1 more reply

sgarland5mo ago

There are plenty of ways to deal with that. You can shard by some other identifier (though I then question your table design), you can assign ranges to each shard, etc.

bigmadshoe5mo ago

I’m really no expert on sharding but if you’re using increasing ints why can’t you just shard on (id % n) or something?

1 more reply

SoftTalker5mo ago

This is mentioned, and in many applications you can safely say you will never need to shard.

1 more reply

hans_castorp5mo ago

With the latest Postgres version (>= 18) you do NOT need a plugin

K0nserv5mo ago

0: https://bsky.app/profile/hugotunius.se/post/3m7wvfokrus2g

1: https://github.com/postgres/postgres/blob/master/src/backend...

qntmfred5mo ago

you may be interested in this postgres extension as well

https://github.com/blitss/typeid-postgres

K0nserv5mo ago

Finished an initial version of mine https://github.com/k0nserv/plid

K0nserv5mo ago

That is indeed interesting.

Mine dedicates 16 bits for the prefix allowing up to 3 characters (a-z alphabet).

grugdev425mo ago

A much simpler solution is to keep your tables as they are (with an integer primary key), but add a non sequential public identifier too.

id => 123, public_id => 202cb962ac59075b964b07152d234b70

There are many ways to generate the public_id. A simple MD5 with a salt works quite well for extremely low effort.

Add a unique constraint on that column (which also indexes it), and you'll be safe and performant for hundreds of millions of rows!

Why do we developers like to overcomplicate things? ;)

Retr0id5mo ago

Per https://news.ycombinator.com/item?id=46273325, if you use a block cipher rather than a hash then you don't even need to store it anywhere.

Denvercoder95mo ago

sgarland5mo ago

For Aurora MySQL, it just makes it worse either way, since there’s no change buffer.

hnfong5mo ago

Ints as pk would be quicker for joins etc though.

grugdev425mo ago

Exactly

mcny5mo ago

Postgresql 18 released in September and has uuidv7

https://www.postgresql.org/docs/current/functions-uuid.html

mexicocitinluez5mo ago

You'll have to rip the ability to generate unique numbers from quite literally anywhere in my app and save them without conflict from my cold, dead hands.

sgarland5mo ago

Sounds like a lot of referential integrity violations.

mexicocitinluez5mo ago

Why would generating a PK ahead of time cause referential integrity violations? Super curious to find out.

sgarland5mo ago

Tbf in Postgres, you can declare FKs to be deferrable, so their existence is checked at transaction commit, rather than at insertion time.

1 more reply

FancyFane5mo ago

Even MySQL benefits from these changes as well. What we're really discussing is random primary key inserts (UUIDv4) vs incrementing primary key inserts (UUIDv6 or v7).

stevefan19995mo ago

Try Snowflake ID then: https://en.wikipedia.org/wiki/Snowflake_ID

gizzlon5mo ago

Leaks time and order, just like uuid v7.

0x4575mo ago

Kinda the point? Ordered PKs give you a better index.

dspillett5mo ago

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities

Wrt: "it isn't x it is y" form: I'm not an LLM, 'onest guv!

dfox5mo ago

> Creating obfuscated values using integers

And do not even think about using RC4 for that (I've seen that multiple times), because that is completely equivalent to XORing with constant.

hippo225mo ago

jakeydus5mo ago

The author doesn't reference any tests that they themselves ran, but they did link a cybertec article [0] with some benchmarks.

[0] https://www.cybertec-postgresql.com/en/unexpected-downsides-...

aynyc5mo ago

I've seen this type of advice a few times now. Now I'm not a database expert by any stretch of imagination, but I have yet to see UUID as primary key in any of the systems I've touched.

rcxdude5mo ago

Uuids also allow the generation of the ID to seperate from the insertion into the database, which can be useful in distributed systems.

cruffle_duffle5mo ago

ellisv5mo ago

UUIDv7 looks really promising but I'm not likely to redo all of our tables to use it.

dangets5mo ago

You can use the same techniques except with the smaller int64 space - see e.g. Snowflake ID - https://en.wikipedia.org/wiki/Snowflake_ID

andatki5mo ago

reffaelwallen5mo ago

At my company we only use UUIDs as PKs.

Main reason I use it is the German Tank problem: https://en.wikipedia.org/wiki/German_tank_problem

(tl;dr; prevent someone from counting how many records you have in that table)

littlestymaar5mo ago

What stops you from having another uuid field as publicly visible identifier (which is only a concern for a minority of your tables).

This way you avoid most of the issues highlighted in this article, without compromising your confidential data.

jakeydus5mo ago

I'm new to the security side of things; I can understand that leaking any information about the backend is no bueno, but why specifically is table size an issue?

boruto5mo ago

In my old company new joiners are assigned an monotonic number as id in tech. GitHub profile url reflected that.

Someone may or may not have used the pattern to get to know the attrition rate through running a simple script every month))

infragreen5mo ago

This was a great read, thank you for sharing!

andatki5mo ago

Appreciate it!

assbuttbuttass5mo ago

https://cloud.google.com/blog/products/databases/announcing-...

bux935mo ago

Aren't people using (big)ints are primary keys, and using UUIDs as logical keys for import/export, solving portability across different machines?

Sayrus5mo ago

bruce5115mo ago

I'd add that it's also used when data is created in multiple places.

Given the amount of data created on phones and tablets, this affects more situations than first assumed.

nrhrjrjrjtntbt5mo ago

Yes and the DB might be columnular or a distributed KV, sidestepping the index problem.

andatki5mo ago

marcus_holmes5mo ago

I didn't see my primary use case for UUID's covered: sharing identifiers across entities is dangerous.

dimitrisnl5mo ago

Noob question, but why no use ints for PK, and UUIDs for a public_id field?

alerighi5mo ago

torginus5mo ago

You can create hash indexes in Postgres, so the secondary index uuid seems workable:

https://www.postgresql.org/docs/current/hash-index.html

edding45005mo ago

*edit: sorry, misread that. My answer is not valid to your question.

javawizard5mo ago

Edit: just saw your edit, sounds like we're on the same page!

javaunsafe20195mo ago

So We make things hard in the backend because of leaky abstractions? Doesn't make sense imo.

jcims5mo ago

dsego5mo ago

scotty795mo ago

One of the benefits of UUIDs is that you can easily merge data coming from multiple databases. Auto-increments cause collisions.

grim_io5mo ago

The article mentions microservices, which can increase the likelihood of collisions in sequential incremental keys.

One more reason to stay away from microservices, if possible.

bardsore5mo ago

Always try to avoid having two services using the same DB. Only way I'd ever consider sharing a DB is if only one service will ever modify it and all others only read.

grim_io5mo ago

Good luck enforcing that :)

mrkeen5mo ago

The 'collision' is two service classes both trying to use one db.

If you separate them (i.e. microservices) the they no longer try to use one db.

grim_io5mo ago

There is nothing stopping multiple microservices from using the same DB, so of course this will happen in practice.

Sometimes it might even be for a good reason.

impoppy5mo ago

mgoetzke5mo ago

And of course we could come up with many ways to generate our own ids and make them unique, but we have the following requirements.

The lookup performance is not really such a big deal for us. Where it is we can do a projection into a more simple format where applicable.

BartjeD5mo ago

But that also adds complexity.

parpfish5mo ago

Two things I don’t like about big-int indexes:

- if you’re debugging and need to search logs, having a simple uuid string is nice for searching

deepsun5mo ago

https://github.com/VictoriaMetrics/VictoriaLogs/blob/v1.41.0...

jwr5mo ago

"if you use PostgreSQL"

(in the scientific reporting world this would be the perennial "in mice")

hyperpape5mo ago

The thing is, none of us are mice, but many of us use Postgres.

It would be the equivalent of "if you're a middle-aged man" or "you're an American".

P.S. I think some of the considerations may be true for any system that uses B-Tree indexes, but several will be Postgres specific.

orthoxerox5mo ago

If you use UUIDv7, you can partition your table by the key prefix. Then the bulk of your data can be efficiently skipped when applying updates.

andatki5mo ago

Good addition!

kijin5mo ago

The space requirement and index fragmentation issue is nearly the same no matter what kind of relational database you use. Math is math.

Just the other day I delivered significant performance gains to a client by converting ~150 million UUIDv4 PKs to good old BIGINT. They were using a fairly recent version of MariaDB.

splix5mo ago

I think the author means all dbs that fit a single server. Because in distributed dbs you often want to spread the load evenly over multiple servers.

esafak5mo ago

To spell it out: it improves performance by avoiding hot spots.

zelphirkalt5mo ago

if they can live with MariaDB, OK, but I wouldn't choose that in the first place these days. Likely Postgres will also perform better in most scenarios.

kijin5mo ago

Had the requirements been different, UUIDv7 would have worked well, too, because fragmentation is the biggest problem here.

miiiiiike5mo ago

The article is muddled, I wish he'd split it into two. One for UUID4 and another for UUID7.

Most of the data that's going to be keyed with a uuid7 is getting served straight out of Redis anyway.

deathanatos5mo ago

> Are UUIDs secure?

> Misconceptions: UUIDs are secure

> One misconception about UUIDs is that they’re secure. However, the RFC describes that they shouldn’t be considered secure “capabilities.”

> From RFC 41221 Section 6 Security Considerations:

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities

The additional size argument is pretty weak too; at "millions of rows", a UUID column is consuming an additional ~24 MiB.

hk13375mo ago

My biggest thing for UUIDs is don’t UUID everything. Most things should be okay with just regular integers as PKs.

p1necone5mo ago

andatki5mo ago

Great!

reactordev5mo ago

nrhrjrjrjtntbt5mo ago

Atlassian settles for longer "ARIs" for this (e.g.https://developer.atlassian.com/cloud/guard-detect/developer...) composed of guids which allow a scheme like the Amazon ARN to pass around.

reactordev5mo ago

yeah, the problem for us was the resource id. What id was it? Was it a post? an upload? a workspace? it wasn't nearly as descriptive as we needed it to be.

moomoo115mo ago

I’ll do it regardless because any time I’ve tried to chase optimization early like this, hardware has always evolved faster.

We have all faced issues where we don’t know where the data will ultimate live that’s optimal for our access patterns.

Or we have devices and services doing async operations that need to sync.

I’m not working on mission critical “if this fails there’s a catastrophic event” type shit. It’s just rent seeking SaaS type shit.

Oh no it cost $0.35 extra to make $100. Next year will make more money relative to cost increase.

AtNightWeCode5mo ago

Uuid4 are only 224bits is a bs argument. Such a made up problem.

But a fair point is that one should use a sequential uuid to avoid fragmentation. One that has a time part.

kgeist5mo ago

Some additional cases we encounter quite often where UUIDs help:

- A client used to run our app on-premises and now wants to migrate to the cloud.

- Support engineers want to clone a client’s account into the dev environment to debug issues without corrupting client data.

- A client wants to migrate their account to a different region (from US to EU).

andatki5mo ago

Lucasoato5mo ago

Hi, a question for you folks. What if I don’t like to embed timestamp in uuid as v7 do? This could expose to timing attacks in specific scenarios.

The context is the classic B2B SaaS, but feel free to share your experiences even if it comes from other scenarios!

lwhi5mo ago

Wouldn't you need to expose UUID if you want to make use of optimistic locking?

Lucasoato5mo ago

I feel that this is among the good reasons to keep exposing UUID in the API.

deepsun5mo ago

> For UUID v4s, primary key values in B-Tree indexes are problematic.

Wrong. Don't use B-Tree for random indexes, there's HASH index exactly for this:

    CREATE INDEX [index_name] ON [table_name] USING HASH ([column_name]);

henning5mo ago

m0005mo ago

Why not just use UUIDs as a unique column next to a bigint PK?

mrkeen5mo ago

Yes, if you're in the group of developers who are passionate about db performance, but have ruled out the idea of spreading work out to multiple DBs, then continuing to use sequential IDs is fine.

kaladin_15mo ago

burnt-resistor5mo ago

mkleczek5mo ago

That's really an important deficiency of Postgres.

Hash index is ideally suited for UUIDs but for some reason Postgres hash indexes cannot be unique.

mcsoft5mo ago

Snowflake or sonyflake ids work:

https://en.wikipedia.org/wiki/Snowflake_ID

https://github.com/sony/sonyflake?tab=readme-ov-file

phendrenad25mo ago

pil0uOP5mo ago

What's in-between? I posted the article because I'm in the middle of that choice and wanted to generate discussion/contradiction.

So far, people have talked a lot about UUIDs, so I'm genuinely curious about what's in-between.

phendrenad25mo ago

An example would be YouTube's video IDs. It's custom-fit for a purpose (security: no, avoiding the problem where people fish for auspicious YouTube video numbers or something: yes).

kortex5mo ago

[0] https://github.com/ulid/spec

sedatk5mo ago

See https://news.ycombinator.com/item?id=46211578

kortex5mo ago

scary-size5mo ago

This reminds me about this old gist for generating Firebase-like "push IDs" [1]. Those have some nicer properties.

[1] https://gist.github.com/mikelehen/3596a30bd69384624c11

whateveracct5mo ago

Sometimes its nice for your PK to be uniformly distributed. As a reader, even if it hurts as a writer. For instance, you can easily shard queries and workloads.

> the impact to inserts and retrieval of individual items or ranges of values from the index.

Classic OLTP vs OLAP.

Jeff_Brown5mo ago

p0w3n3d5mo ago

What about newest postgresql support for uuidv7? Anybody did tests? This is what we're heading towards at the moment of writing so I'd like to ask to eventually roll back the decision

cruffle_duffle5mo ago

Letting the client generate the primary keys is really liberating. Not having to care about PK collisions or leaking information via auto incrementing numbers is great!

In my opinion uuid isn’t used enough!

chuckadams5mo ago

> If you have a primary key whose generation is truly random such that each number is equally likely, then that b-tree is always going to be balanced.

Balanced and uniformly scattered. A random index means fetching a random page for every item. Fine if your access patterns are truly random, but that's rarely the case.

> Why are you clustering on a random opaque key?

One shouldn't immediately freak out over having a random PK, but should definitely at least be aware of the potential degradation they might cause.

cruffle_duffle5mo ago

I feel, honestly, like while you are indeed correct for most cases it’s absolutely fine to use some flavor of uuid. I feel like the benefits outweighs the cost in most cases.

chuckadams5mo ago

Sure, and for many cases, uuidv7 is that flavor. It just comes with a timestamp, which may or may not be an issue. It isn't an issue for me, which is why I use it myself.

old8man5mo ago

Very useful article, thank you! Many people suggest CUID2, but it is less efficient and is better used for frontend/url encoding. For backend/db, only UUID v7 should be used.

p2detar5mo ago

0 - https://ardentperf.com/2024/02/03/uuid-benchmark-war

cebert5mo ago

Using UUIDs as primary keys in non-relational databases like DynamoDB is valid and doesn’t raise the concerns mentioned in the article.

andatki5mo ago

Good point that the post should be made clear it’s referring only to my experience with Postgres.

Xorakios5mo ago

I was born in the Vancouver Social Security office #538 area in 1962.

Guess how much I have to defend against attackers trying 538-62-xxxx

Waterluvian5mo ago

Any decent resources with benchmark data on Postgres insertion, indexing, retrieve, etc. for UUID vs. integer based PKs?

maxnoe5mo ago

Would this argument still apply if I need to store a uuidv4 anyway in the table?

And I'd likely want a unique constraint on that?

nesarkvechnep5mo ago

raxxorraxor5mo ago

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities

waynenilsen5mo ago

What kills me is I can’t double click the thing to select it.

mrits5mo ago

This application specific. iTerm2 doesn't break up by - why firefox does.

socketcluster5mo ago

My advice is: Avoid Blanket Statements About Any Technology.

I don't know what happened to our industry; we're supposed to be intelligent people but I see developers falling for these obvious logical fallacies over and over.

I do find it kind of sly though how the community decided to make this UUIDv7 instead of creating a new standard for it.

thraxil5mo ago

ezekg5mo ago

p2detar5mo ago

Nice feedback. Out of curiosity, have you made any fine-tuning to psql that greatly improved performance?

thraxil5mo ago

Nope. Out of the box GCP Cloud SQL instance.

kunley5mo ago

Wasn't choosing uuids as ids falling for the deceptive argument in the first place?

christophilus5mo ago

Not really, no. They’re very convenient for certain problems and work really well in general. I’ve never had a performance issue where the problem boiled down to my use of UUID.

danparsonson5mo ago

You never having seen the problem doesn't mean it never happens; I have dealt with a serious performance problem in the past that was due to excessive page fragmentation due to a GUID PK.

kunley5mo ago

What are these certain problems, if I may ask?

1 more reply

tom_m5mo ago

It depends. What a narrow minded article, sigh. Bring on the vibe coding! Let's just be mindless.

dizlexic5mo ago

Meh, uuid go brrrrrrrrrrrrrrrrrrrrrrrrr

sneak5mo ago

This is why ULID exists and why I use them in my ext_id columns. For the actual relational IDs internal to the db I use smaller/faster data types.

j / k navigate · click thread line to collapse