Understanding How UUIDs Are Generated (opens in new tab)

(digitalbunker.dev)

169 pointsaryamansharda5y ago60 comments

60 comments

After discovering ULIDs [0] I can't see ever using UUIDs ever again.

ULIDs are sortable (time component), short (26 chars) and nearly human readable, and good enough entropy/randomness for everything I'd ever be working on.

Does anyone have any criticisms of ULIDs? I can't see how they don't take over general purpose use of unique ids in the future except where a more guarantee of uniqueness is needed. (ie, bajillion records a second unique...)

[0] https://github.com/ulid/spec

Lazare5y ago

The concept of ULID is interesting, but the spec is a bit weird[1]. If you want the benefits of ULID, I'd highly suggest checking out KSUIDs:

https://github.com/segmentio/ksuid

https://segment.com/blog/a-brief-history-of-the-uuid/

Same advantages of ULIDs, but I prefer the base62 to the base32 encoding (more compact; no need to bikeshed about upper versus lower case), it's been tested at scale, and the decisions made are sensible.

[1]: Specifically, they try and guarantee absolute monotonicity. The way they do this is that if you ever try and generate more than on ULID per millisecond, you increment the least significant bit of the random component. In other words, we have a key that's basically <timestamp>-<random-int>, and if you generate more than one key per timestamp, you just increment the random number. If the random number would overflow, by the spec, you have to just throw an exception; no wraparound. There's a lot of issues here. For one thing, none of this can possibly work if you're generating your IDs in a distributed fashion; it assumes a single, central, consistent key generator. For another, our key generator now has state, since it needs to know if any keys have been generated earlier, and if so, what they were. Doable, but...potentially a lot of work depending on your environment. Also, why are we even trying to force strict monotonicity? What does that possibly gain you? Why would we want a spec that, by design, has a chance of sometimes not letting you generate a key? The whole thing feels like the result of someone really wanting an auto-incrementing primary key, hearing that UUIDs were cool, and trying to make a auto-incrementing primary key that looks like a UUID, ending up without the advantages of either. Of course, you could ignore the spec (and several implementations do), but at this point it's worth asking what you're gaining from ULID. It's a weird feature that basically only works if you don't need it (since realistically, anyone generating many keys per millisecond would of course need to generate the keys in a distributed fashion).

Vanderson5y ago

Thanks for the alternative (KSUIDs) I took a quick look at it, and from what I can tell, if you aren't generating thousands of IDs 1 per second (or more) they achieve the exact same result. [0]

I find the argument that ULID won't work under extreme and harsh conditions proof that it's just fine for many of us that simply do not work on systems with that kind of load/requirements.

I appreciate seeing the weaknesses of ULID, as this helps me choose whether or not I can live with them. (which I can)

Again, thanks for the detailed reply, it was very helpful.

[0] https://github.com/segmentio/ksuid/issues/8

Lazare5y ago

Yeah. ULID will work. It's just that there's a lot of small annoyances and quirks, and no real advantages. If you've already adopted it, it's probably not worth changing, but I can't think why you'd select it given a choice for greenfield development.

1 more reply

j-pb5y ago

I don't see how a distributed system is an "extreme condition". If you don't have distributed ID generation, why not use an auto incrementing u64 and call it a day?

1 more reply

inopinatus5y ago

Sadly they wrap around in 133 years. Maybe Segment don't plan on still existing in 2153? Also, base62 really should give way to base58 for any identifier that an ops person might have to type in a hurry.

Thus the quest for the perfect identifier continues.

NB: generating many unique IDs per millisecond for a long period may be a hallmark of a large distributed system, but even a small application may want this for a brief time e.g. importing bulk customer data.

shezi5y ago

I do not understand your criticism. ULID tries to guarantee monotonicity with any single generator, which is nice if you need it and irrelevant if you don't. The state that needs to be saved for that feature is exactly the last generated ULID (if it's the same microsecond still, increase, else regenerate). And if you are generating on the order of 2^40 ULIDs per microsecond, you'll have to have a larger id space anyway.

So,for me, your criticism boils down to "this is weird because it has a feature I don't need". Why would you care?

Lazare5y ago

> ULID tries to guarantee monotonicity with any single generator, which is nice if you need it and irrelevant if you don't

The thing is, you don't need it. Nobody needs it. And I know this is mean, but...if you think people need it, you probably shouldn't be writing specs for things like this, because it suggests you haven't really thought about the problem.

Keep in mind:

1. If you want guaranteed monotinicity within a single generator, just increment an integer in a DB column. This is a solved problem!

2. ULID cannot guarantee monotonicity. Instead, the spec does something that will guarantee it if you use the library in an extremely specific way. The second you generate two ULIDs on different systems (or even in different processes on the same system), all bets are off. Which means you really shouldn't rely on that strict monotonicity!

3. But if you can't rely on it, then it serves no purpose. And if it serves no purpose, then you could remove it, and at a stroke it becomes much simpler to implement tje spec. As you note, you don't need much state, but any state greatly complicates something like this.

> your criticism boils down to "this is weird because it has a feature I don't need"

It's a feature that literally nobody needs, implemented in a way that doesn't work. It's presence is so bizarre, it raises questions about the entire spec.

1 more reply

pmoriarty5y ago

Speaking of human-readable, I really like ssh's "randomart" visualizations of ssh fingerprints.[1][2][3]

They're much easier for humans to differentiate than the usual long string of hex characters (even 26 characters is too long to reliably compare when a single mismatched character might make all the difference).

Examples of randomart:

  Generating public/private rsa key pair.
  The key fingerprint is:
  05:1e:1e:c1:ac:b9:d1:1c:6a:60:ce:0f:77:6c:78:47 you@i
  The key's randomart image is:
  +--[ RSA 2048]----+
  |       o=.       |
  |    o  o++E      |
  |   + . Ooo.      |
  |    + O B..      |
  |     = *S.       |
  |      o          |
  |                 |
  |                 |
  |                 |
  +-----------------+
  
  Generating public/private dsa key pair.
  The key fingerprint is:
  b6:dd:b7:1f:bc:25:31:d3:12:f4:92:1c:0b:93:5f:4b you@i
  The key's randomart image is:
  +--[ DSA 1024]----+
  |            o.o  |
  |            .= E.|
  |             .B.o|
  |              .= |
  |        S     = .|
  |       . o .  .= |
  |        . . . oo.|
  |             . o+|
  |              .o.|
  +-----------------+

[1] - http://www.dirk-loss.de/sshvis/drunken_bishop.pdf

[2] - https://www.man7.org/linux/man-pages/man1/ssh.1.html

[3] - https://superuser.com/questions/22535/what-is-randomart-prod...

npteljes5y ago

I love randomart. To see the randomart of the host you're connecting to, append this to the ssh command:

  ssh user@host -o VisualHostKey=yes

To see the randomart of your own key, or your known hosts:

  ssh-keygen -lv -f ~/.ssh/mykey
  ssh-keygen -lv -f ~/.ssh/known_hosts

ShorsHammer5y ago

> To see the randomart of the host you're connecting to

put it in a ~/.ssh/config or /etc/ssh/ssh_config

why type this stuff over and over?

inopinatus5y ago

> Does anyone have any criticisms of ULIDs

I'm not a fan of the Crockford encoding, since it supports noncanonical forms that'll silently trash the lexicographic sorting assertion when present, and the exclusion of "U" as some kind of profanity filter is both prissy and ineffective.

base58 seems better, vs my own fat fingers at any rate.

arethuza5y ago

"It excludes the letters I, L, and O to avoid confusion with digits. It also excludes the letter U to reduce the likelihood of accidental obscenity."

Excluding ILO doesn't seem a bad idea - but leaving out U for that particular reasons seems downright weird:

https://en.wikipedia.org/wiki/Base32#Crockford's_Base32

inopinatus5y ago

It doesn’t exclude them. Ignore the Wikipedia text, and look at the decoding table instead.

1 more reply

munawwar5y ago

base58 avoids 0 as it looks similar to capital O in some fonts. Seems more reasonable than U.

minitoar5y ago

Maybe you don’t want to leak the time component.

mianas5y ago

This occurred to me after looking at the Wikipedia article [1], and seeing the different "versions" and the rationale about not leaking data in the ID.

In general, if you want to encode info in your IDs, you can do that. Just make sure you want to do that, and that you don't run out of entropy.

I view random UUIDs as a silver-bullet type solution to assigning IDs, without overlap. (8 bytes should be enough to just assign them at random)

1: https://enwp.org/UUID

jepler5y ago

8 bytes will have a non-negligible (say, 1 in 1 million) chance of generating the same identifier twice even if you only generate a few million million IDs -- see https://en.wikipedia.org/wiki/Birthday_problem#Probability_t... and look at the "16 hex digits" row. Depending on your use case, this may or may not be a problem.

1 more reply

Vanderson5y ago

I had not considered that. In my use case it's not an issue, but I can see that if the ID value is publicly viewable (ie, in a URL) and can reveal sensitive info via the timestamp... thanks.

chrisandchris5y ago

So you basically mean UUID v6, the missing version of UUID which would be sortable too [1] [2]

[1] https://tools.ietf.org/html/draft-peabody-dispatch-new-uuid-... [2] http://gh.peabody.io/uuidv6/

skrause5y ago

It's funny that this took so long since multiple people in this thread (including me) mentioned that they implemented their own UUID generation function which encodes the timestamp into the first bits just to create sortable UUIDs.

Vanderson5y ago

It appears that UUID v6 is still a proposal, as of August 2020.

Also, this sounds like a non-starter for UUID v6:

"Like version 1 UUIDs, version 6 UUIDs use a MAC address from a local hardware network interface."

https://uuid.ramsey.dev/en/latest/nonstandard/version6.html

henryfjordan5y ago

The sort-ability of ULIDs is of questionable usefulness. It looks like the generator relies on the clock of the machine generating the ID. If you generate ULIDs across machines, there is no guarantee that there isn't a time-drift. So to rely on that, you'd have to have a single oracle generating the ids. Why not just use an auto-incrementing ID at that point?

scrollaway5y ago

> Why not just use an auto-incrementing ID at that point?

Because sometimes (and by sometimes I mean "surprisingly often"), you don't care about exact sortability but simply want roughly-correct ordering.

Examples for most social media sites:

"Give me the latest 100 tweets/videos/posts for this user" => Milliseconds differences won't matter, because users don't tweet/post videos that often.

"Give me the latest 1000 tweets/videos/posts for this search" => Even a significant drift won't matter, because you don't care if things aren't exactly in the correct order, you just care about showing recent stuff.

And at that scale (even long before that scale), having a single oracle for auto-incrementing IDs is a hassle. So this is a nice solution any time you need globally unique IDs, need to support sharding, and your default sort is time-based (or whatever-based if there's another piece of info you want to put in that timestamp portion of the ID, as long as "drift" is a non-issue).

BTW, not just theoretical: I believe this exact reasoning is why twitter came up with time-sortable snowflake IDs.

henryfjordan5y ago

That's fair, if you care about close-enough time-ordering but don't care about strict global causality then ULIDs would be useful. I suppose if you have a partition-able workload (say users posts go into Kafka keyed on userId) then you can use ULIDs to have guaranteed order within a partition, which is probably enough for many use-cases.

Lazare5y ago

One reason is database indexes. If you shove truly random data into a standard B-tree (or whatever), performance tends to suck. If the start of the key lets you roughly sort by time, performance improves.

It's not super uncommon for people to use a normal UUID (usually v1, NOT v4; you need a timestamp), then restructure the fields so the timestamp is in front on save, then flip it back on load; this gives you a "proper" UUID, but (in theory) gives you better performance. See, eg, here: https://www.percona.com/blog/2014/12/19/store-uuid-optimized...

Now, ULID mades an odd attempt at making keys strictly sortable by time, which 1) doesn't work and 2) is pointless. :-) In that case, you really would be better off with an auto-incrementing idea. But while their implementation is questionable, the idea isn't absurd.

dheera5y ago

> sortable (time component)

You can easily prepend a unix millisecond timestamp to a UUID if you want time-sortability, i.e.

174e1da9377-d739f4ed-6aa1-4c99-a0bf-3b0b68e9ce77

where 174e1da9377 is a unix timestamp in milliseconds.

> short (26 chars)

Short == increased possibility of a collision. If you're okay with this possibility, you can just use UUIDv4 and truncate it.

> and nearly human readable

Convert your UUID into base26 or whatever you wish for human readbility.

> and good enough entropy/randomness

Subjective, if you are okay with "good enough entropy" you could arguably just use a random string of the number of bits you'd like.

marcus_holmes5y ago

This makes a ton more sense than messing with the UUID itself.

Also, for most applications you can just add a timestamp record and sort by that.

ChrisMarshallNY5y ago

> Does anyone have any criticisms of ULIDs?

Only that they aren’t more widely adopted or supported.

I use UUIDs in things like OBSERVER support[0]. They just need to be unique within the context of the runtime, so there’s no need for anything more.

However, for things like Bluetooth Attributes, I think ULIDs would be a good thing. That ship has probably sailed, though.

[0] https://github.com/RiftValleySoftware/RVS_GeneralObserver#ob...

elcomet5y ago

Well, you might not want your ids to be sortable by time (you don't want it possible to see which thing was created before which). So the unsortable nature of UUIDs can be a feature.

asadawadia5y ago

+1.

Combine that with KV stores like badger or rocks that store things in lexicographic order - ULIDs really comes in handy to have a random ID but still be able to do a scan in sorted order

lazulicurio5y ago

I "independently invented" ULIDs for a past work project.

Not a criticism, but something to be aware of, is that SQL Server doesn't sort the uniqueidentifier data type as one would expect. Instead of being ordered by bytes 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15, uniqueidentifier values are ordered by bytes 10-11-12-13-14-15-8-9-7-6-5-4-3-2-1. So you want the timestamp in the last 48 bits instead of the first 48 bits.

Vanderson5y ago

Well, this is beyond my ability to grok.

Does this apply to ULIDs?

lazulicurio5y ago

It will, but looking at the SQL Server implementation[1] it looks like they're already aware of that.

[1] https://github.com/rmalayter/ulid-mssql

marcus_holmes5y ago

I never get this. Why are you sorting by ID?

munawwar5y ago

Don't know about the OP, but there are some rarer cases, like dynamodb, where more indexes means more $, you can avoid creating a timestamp column & new index by having a sortable id column.

But even without the need to sort by id, one of the advantage is that sequential ids makes it easier for databases to fetch data, as data stored ends up being more sequential in disk as well.

60 comments

Vanderson5y ago

After discovering ULIDs [0] I can't see ever using UUIDs ever again.

ULIDs are sortable (time component), short (26 chars) and nearly human readable, and good enough entropy/randomness for everything I'd ever be working on.

[0] https://github.com/ulid/spec

Lazare5y ago

The concept of ULID is interesting, but the spec is a bit weird[1]. If you want the benefits of ULID, I'd highly suggest checking out KSUIDs:

https://github.com/segmentio/ksuid

https://segment.com/blog/a-brief-history-of-the-uuid/

Vanderson5y ago

Thanks for the alternative (KSUIDs) I took a quick look at it, and from what I can tell, if you aren't generating thousands of IDs 1 per second (or more) they achieve the exact same result. [0]

I find the argument that ULID won't work under extreme and harsh conditions proof that it's just fine for many of us that simply do not work on systems with that kind of load/requirements.

I appreciate seeing the weaknesses of ULID, as this helps me choose whether or not I can live with them. (which I can)

Again, thanks for the detailed reply, it was very helpful.

[0] https://github.com/segmentio/ksuid/issues/8

Lazare5y ago

1 more reply

j-pb5y ago

I don't see how a distributed system is an "extreme condition". If you don't have distributed ID generation, why not use an auto incrementing u64 and call it a day?

1 more reply

inopinatus5y ago

Thus the quest for the perfect identifier continues.

shezi5y ago

So,for me, your criticism boils down to "this is weird because it has a feature I don't need". Why would you care?

Lazare5y ago

> ULID tries to guarantee monotonicity with any single generator, which is nice if you need it and irrelevant if you don't

Keep in mind:

1. If you want guaranteed monotinicity within a single generator, just increment an integer in a DB column. This is a solved problem!

> your criticism boils down to "this is weird because it has a feature I don't need"

It's a feature that literally nobody needs, implemented in a way that doesn't work. It's presence is so bizarre, it raises questions about the entire spec.

1 more reply

pmoriarty5y ago

Speaking of human-readable, I really like ssh's "randomart" visualizations of ssh fingerprints.[1][2][3]

Examples of randomart:

  Generating public/private rsa key pair.
  The key fingerprint is:
  05:1e:1e:c1:ac:b9:d1:1c:6a:60:ce:0f:77:6c:78:47 you@i
  The key's randomart image is:
  +--[ RSA 2048]----+
  |       o=.       |
  |    o  o++E      |
  |   + . Ooo.      |
  |    + O B..      |
  |     = *S.       |
  |      o          |
  |                 |
  |                 |
  |                 |
  +-----------------+
  
  Generating public/private dsa key pair.
  The key fingerprint is:
  b6:dd:b7:1f:bc:25:31:d3:12:f4:92:1c:0b:93:5f:4b you@i
  The key's randomart image is:
  +--[ DSA 1024]----+
  |            o.o  |
  |            .= E.|
  |             .B.o|
  |              .= |
  |        S     = .|
  |       . o .  .= |
  |        . . . oo.|
  |             . o+|
  |              .o.|
  +-----------------+

[1] - http://www.dirk-loss.de/sshvis/drunken_bishop.pdf

[2] - https://www.man7.org/linux/man-pages/man1/ssh.1.html

[3] - https://superuser.com/questions/22535/what-is-randomart-prod...

npteljes5y ago

I love randomart. To see the randomart of the host you're connecting to, append this to the ssh command:

  ssh user@host -o VisualHostKey=yes

To see the randomart of your own key, or your known hosts:

  ssh-keygen -lv -f ~/.ssh/mykey
  ssh-keygen -lv -f ~/.ssh/known_hosts

ShorsHammer5y ago

> To see the randomart of the host you're connecting to

put it in a ~/.ssh/config or /etc/ssh/ssh_config

why type this stuff over and over?

inopinatus5y ago

> Does anyone have any criticisms of ULIDs

base58 seems better, vs my own fat fingers at any rate.

arethuza5y ago

"It excludes the letters I, L, and O to avoid confusion with digits. It also excludes the letter U to reduce the likelihood of accidental obscenity."

Excluding ILO doesn't seem a bad idea - but leaving out U for that particular reasons seems downright weird:

https://en.wikipedia.org/wiki/Base32#Crockford's_Base32

inopinatus5y ago

It doesn’t exclude them. Ignore the Wikipedia text, and look at the decoding table instead.

1 more reply

munawwar5y ago

base58 avoids 0 as it looks similar to capital O in some fonts. Seems more reasonable than U.

minitoar5y ago

Maybe you don’t want to leak the time component.

mianas5y ago

This occurred to me after looking at the Wikipedia article [1], and seeing the different "versions" and the rationale about not leaking data in the ID.

In general, if you want to encode info in your IDs, you can do that. Just make sure you want to do that, and that you don't run out of entropy.

I view random UUIDs as a silver-bullet type solution to assigning IDs, without overlap. (8 bytes should be enough to just assign them at random)

1: https://enwp.org/UUID

jepler5y ago

1 more reply

Vanderson5y ago

I had not considered that. In my use case it's not an issue, but I can see that if the ID value is publicly viewable (ie, in a URL) and can reveal sensitive info via the timestamp... thanks.

chrisandchris5y ago

So you basically mean UUID v6, the missing version of UUID which would be sortable too [1] [2]

[1] https://tools.ietf.org/html/draft-peabody-dispatch-new-uuid-... [2] http://gh.peabody.io/uuidv6/

skrause5y ago

Vanderson5y ago

It appears that UUID v6 is still a proposal, as of August 2020.

Also, this sounds like a non-starter for UUID v6:

"Like version 1 UUIDs, version 6 UUIDs use a MAC address from a local hardware network interface."

https://uuid.ramsey.dev/en/latest/nonstandard/version6.html

henryfjordan5y ago

scrollaway5y ago

> Why not just use an auto-incrementing ID at that point?

Because sometimes (and by sometimes I mean "surprisingly often"), you don't care about exact sortability but simply want roughly-correct ordering.

Examples for most social media sites:

"Give me the latest 100 tweets/videos/posts for this user" => Milliseconds differences won't matter, because users don't tweet/post videos that often.

BTW, not just theoretical: I believe this exact reasoning is why twitter came up with time-sortable snowflake IDs.

henryfjordan5y ago

Lazare5y ago

dheera5y ago

> sortable (time component)

You can easily prepend a unix millisecond timestamp to a UUID if you want time-sortability, i.e.

174e1da9377-d739f4ed-6aa1-4c99-a0bf-3b0b68e9ce77

where 174e1da9377 is a unix timestamp in milliseconds.

> short (26 chars)

Short == increased possibility of a collision. If you're okay with this possibility, you can just use UUIDv4 and truncate it.

> and nearly human readable

Convert your UUID into base26 or whatever you wish for human readbility.

> and good enough entropy/randomness

Subjective, if you are okay with "good enough entropy" you could arguably just use a random string of the number of bits you'd like.

marcus_holmes5y ago

This makes a ton more sense than messing with the UUID itself.

Also, for most applications you can just add a timestamp record and sort by that.

ChrisMarshallNY5y ago

> Does anyone have any criticisms of ULIDs?

Only that they aren’t more widely adopted or supported.

I use UUIDs in things like OBSERVER support[0]. They just need to be unique within the context of the runtime, so there’s no need for anything more.

However, for things like Bluetooth Attributes, I think ULIDs would be a good thing. That ship has probably sailed, though.

[0] https://github.com/RiftValleySoftware/RVS_GeneralObserver#ob...

elcomet5y ago

Well, you might not want your ids to be sortable by time (you don't want it possible to see which thing was created before which). So the unsortable nature of UUIDs can be a feature.

asadawadia5y ago

+1.

Combine that with KV stores like badger or rocks that store things in lexicographic order - ULIDs really comes in handy to have a random ID but still be able to do a scan in sorted order

lazulicurio5y ago

I "independently invented" ULIDs for a past work project.

Vanderson5y ago

Well, this is beyond my ability to grok.

Does this apply to ULIDs?

lazulicurio5y ago

It will, but looking at the SQL Server implementation[1] it looks like they're already aware of that.

[1] https://github.com/rmalayter/ulid-mssql

marcus_holmes5y ago

I never get this. Why are you sorting by ID?

munawwar5y ago

Don't know about the OP, but there are some rarer cases, like dynamodb, where more indexes means more $, you can avoid creating a timestamp column & new index by having a sortable id column.

But even without the need to sort by id, one of the advantage is that sequential ids makes it easier for databases to fetch data, as data stored ends up being more sequential in disk as well.