Lock down the prefix string now before it’s too late and document it. I see in Go that it’s lowercase ascii, which seems fine except for compound types (like “article-comment”). May be worth looking at allowing a single separator given that many complex projects (and ORMs) can’t avoid them.
The Go implementation has no tests. This is very unit-testable. Add tests goddammit!
For Go, I’d align with Googles UUID implementation, with proper parse functions and an internal byte array instead of strings. Strings are for rendering (and in your case, the prefix). Right now, it looks like the parsing is too permissive, and goes into generation mode if the suffix is empty. And the SplitN+index thing will panic if no underscores, no? Anyway, tests will tell.
As for the actual design decisions, I tried to poke holes but I fold! I think this strikes the sweet spot between the different tradeoffs. Well done!
We have tests for the base32 encoding which is the most complicated part of the implementation (https://github.com/jetpack-io/typeid-go/blob/main/base32/bas...) but your point stands. We'll add a more rigorous test suite (particularly as the number of implementations across different languages grows, and we want to make sure all the implementations are compatible with each other)
Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?
There is just a single test. Which only tests the decoding of a single known value. No encoding test.
Go has infrastructure for benchmarking and fuzzing. Use it!
Also, you took code from https://github.com/oklog/ulid/blob/main/ulid.go which has "Copyright 2016 The Oklog Authors" but this is not mentionned in your base32.go.
I didn’t look into it much but it seems like a great encoding even outside of this project. Predictable length, reasonable density, “double clickable” etc. I’ve been annoyed with both hex and base64 for a while so it’s pretty cool just by itself.
> Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?
Yeah, the worry is almost entirely “subtle deviations across stacks”, which is usually due to ambiguous specs. It’s so annoying when there’s minor differences, compatibility options etc (like base64 which has another “URL-friendly” encoding - ugh).
It would be great if you add suggestions for compound types (like “article-comment”) in README as OP stated as well.
Yep. The readme asks people to provide other implementations. Having a test suite would be good for third-party code.
1. We've now implemented pretty thorough testing: https://github.com/jetpack-io/typeid-go/blob/main/typeid_tes...
2. I clarified the prefix in the spec
Thanks for the feedback!
Hey, you’re pretty smart. How about you add them?
If you mean that criticism is only allowed if you are willing to commit labor, I disagree with that. I always welcome critique myself - it may be something that I’ve missed. The maintainers always has the last word. As long as there are no hidden expectations, it’s all good.
1. I don't believe people actually hand type-in these values, so I'm not really concerned about the 'l' vs '1' issue. I do base 32 without `eiou` (vowels) to reduce the likelihood of words (profanity) sneaking in.
2. I add two base-32 characters as a checksum (salted of course). This is prevents having to go look at the datastore when the value is bogus either by accident or malice. I'm unsure why other implementations don't do this.
We had “analrita” as an autogenerated password that resulted in a complaint many years ago. Might consider adding ‘a’ as an excluded letter.
So adding an excluded letter is not easy.
> either by accident or malice
1. if you don’t believe people hand type these then how else will they accidentally enter an invalid? I suppose copy/paste errors, or if a page renders it as uppercase, though you should just normalize the case if it’s base 32.
2. How does a 2 byte (non-cryptographically secure) checksum help in the case of malice?
Crockford is being cheeky. To make a nice base32 alphabet out of non-confusable alphanumeric characters you only need to exclude O, I, and L. This leaves you with 33 characters still, so you need to remove one more, and it doesn't matter which one you remove, so you might as well pick an arbitrary reason for the last character that gets removed (and it's not the worst reason, if your goal is to use these as user-readable IDs, although obviously it's not even remotely bulletproof).
Guess that's a credit to the general civility of the community.
EDIT: It appears that other people in this thread are freely using profanity, so either your comment was targeted by automation due to the unusual density of banned words, or it's a joke that went over my head :)
- base 58 - Satoshi's/Bitcoin's https://en.wikipedia.org/wiki/Binary-to-text_encoding#Base58
- "base62" - Keybase's saltpack https://github.com/keybase/saltpack
- The famous "Adobe 85" - https://en.wikipedia.org/wiki/Ascii85
- basE91 - https://base91.sourceforge.net
At work we defined several new "bases" for QR code. IMHO, it is an under applied area of computer science.
In the end I think we had a couple of characters to spare, and so, sitting by ourselves because everyone else had gone home for the day, we ranked swear words by how offensive they were to prioritize removal of a few extra letters. Then I convinced him that slurs were a bigger problem so we focused on that, which got rid of the letter n, instead of u
tggr is just cute, n**r is an uncomfortable conversation with multiple HR teams (we were B2B)
I'm a bit fuzzy now on what our ultimate character set was, because typically you're talking [a-z][0-9], an there are a lot of symbols you can't use in urls and some that are difficult to dictate. My recollection is that we eliminated both 0, l, and 1, but I think we relied on transcription happening either from all caps or all lowercase. 0o are not a problem. Nor are 1L.
Y is pretty versatile for pissing people off.
eg: google "allinurl:fag site:youtube.com"
Because he’s an American?
Note that people generally do not type in object identifiers, but they do frequently cut-and-paste them between applications and chat/forum interfaces, forward them by email, search for them in log files. Verbal transmission is rare to non-existent. Under these conditions, pronunciation proves irrelevant, and case-insensitivity becomes an impediment, but consistency and paste/break resilience become necessary.
Base 58 offers a bijective encoding that fits these concerns much more effectively and is more compact to boot. Similarly inspired by Stripe, I've been using type-prefixed base58-encoded UUIDs for object identifiers for some years. user_1BzGURpnHGn6oNru84B3Ri etc.
Edit to add: to be fair to Douglas Crockford, his encoding of base 32 was designed two decades ago, when the usage landscape looked quite different.
Ultimately I ended up leaning towards a base32 encoding, because I didn't want to pre-suppose case sensitivity. For example, you might want to use the id as a filename, and you might be in an environment where you're stuck with a case insensitive filesystem.
Note that TypeID is using the Crockford alphabet and always in lowercase – *not* the full rules of Crockford's encoding. There's no hyphens allowed in TypeIDs, nor multiple encodings of the same ID with different variations of the ambiguous characters.
1. a screenshot or a screen share that contains an identifier
2. another device where I can’t easily take an identifier
One feature I do like from crockford32, that base58 lacks, and which also assists transcription from noisy sources, is the check symbol. So much that it is quite unfortunate that this check symbol is optional. In 2023 it's hard to fight the urge to specify a mandatory emoji to encode a check value (caveat engineer: this is not actually a good idea :))
My first choice would be to just use type-prefixed KSUIDs, which gives you 160-bit K-sortable IDs with base62 encoding, which works great unless you need 128-bit IDs for compatability reasons.
My favorite base-32 encoding is z-base-32, which I find just gentler on the eyes: https://philzimmermann.com/docs/human-oriented-base-32-encod...
The biggest problems with base58 are 1) it works for integers, less so for arbitrary binary data like crypto keys 2) case-sensitivity ISnOtNIcEtOLoOKaT (in my opinion).
z-base32 has some nice ideas, although I don’t really give a damn how these things look except where that has functional/ergonomic consequences, since none of them have real aesthetic value. The beauty of numbers is in their structural properties, not their representations. If we really cared about how it feels I’d suggest using an S/KEY-style word mapping instead to get some poetry out of it.
https://joist-orm.io/docs/advanced/tagged-ids
We'd used `:` as our delimiter, but kinda regretting not using `_` because of the "double-click to copy/paste" aspect...
In theory it'd be really easy to get Joist to take "uuid columns in the db" and turn them into "typeids in the domain model", but probably not something that could be configured/done via userland atm...that'd be a good idea though.
The `t<X>` makes sense; we currently guess a tag name of "FooBarZaz" --> "fbz", but allow the user to override it, so you could hand-assign "t1", "t2", etc. as you added entities to the system.
Abbreviating/base36-ing even the auto-incremented numeric primary key to optimize lengths is neat; we haven't gotten to the size of ids where that is a concern, but it sounds like a luxurious problem to have! :-)
IDs generation is usually private to a company scope and rarely need to be "universally unique".
Details matter.
And in my experience most people somehow think a UUID must be stored into the human friendly hex representation, dashes included. Wasting so much space in database, network, memory.
The reality is that it is just like any other engineering situation. Sit down, write down your requirements, and see what, if anything, solves it.
Reading about the advantages of various formats is very helpful in helping you skip learning about certain things the hard way and use somebody else's experience of learning them the hard way instead. From that point of view I recommend at least glancing through them all. Sortability and time-based locality is one that you may not naturally think about, and if you need it, you will appreciate not learning that the hard way four years into a project after you threw that data away and then realizing you needed it. And some UUID formats actually managed to introduce small security issues into themselves (thinking MAC address leak from UUID v1 here), nice to avoid those too.
If you have a use case where there's an existing solution then, hey, great, go ahead and use it. Maybe if anyone ever needs that but in another language they can pull a library there too.
But if not, don't sweat it. The biggest use of UUIDs I personally have I specified as "just send me a unique string, use a UUID library of your choice if it makes you feel better". I think I've got a unique format per source of data in this system and it's fine. I don't have volume problems, it's tens of thousands of things per day. I don't have any need to sort on the UUID, they're not really the "identifier", they're just a unique token generated for a particular message by the originator of the message so we can detect duplicate arrivals downstream in a heterogenous system where I can't just defer that task to the queue itself since we have multiple. I don't even need them to be globally unique, I just need them unique within a rather small shard, and in principle I wouldn't even mind if they get repeated after a certain amount of time (though I left the system enforcing across all time anyhow for simplicity). In this particular case, I do indeed generate my own UUIDs for the stuff I'm originating by just grabbing some stuff from /dev/urandom and encoding it base64, with a size selected such that base64 doesn't end the encoding with ==. Even that's just for aesthetic's sake rather than any actual problem it would cause.
You can't guarantee that this will be globally unique.
Per Wikipedia, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.
so-youre-saying-theres-a-chance.gif
ksuid is the best general purpose id generator with sort-able timestamps I've found and has libraries in most languages. UUID v1-7 are wasteful.
The spec has a really weird choice in it - it attempts to guarantee monotonicity. There's a few problems with that, namely that it's almost entirely pointless, it's impossible to actually guarantee it, and the spec mandates it be done in a bad way. Breaking those down:
* There's no particular reason why you'd expect UIDs to give you monotonic ordering, and no particular use case for it either. Being roughly k-sortable is very useful (eg, to allow efficient DB indexes), but strict monotonicity? There's just very few cases where that's needed, and when it is, you'd probably want a dedicated system to coordinate that. Even worse, if you do need it, ULIDs don't actually guarantee it. Which brings us to:
* UIDs are generally most useful in distributed systems, and the more you scale the more that will become required. The second you have a second system generating UIDs, your monotonicity guarantees go out the window, and without work that the various ULID libraries have not done, that's even true the moment you have a second process generating them. And the ULID spec doesn't even try and work around this (nor do all ULID libraries try to guarantee monoticity even when used in a single threaded manner), which in turn means you have no idea if any given ULID was generated in a way where monoticity is being attempted.
* The actual ULID spec says the UID is broken up into a timestamp and a random component, and if you ever generate a second ULID in the same millisecond, you must increment the random component, and if you can't due to overflow, generation must fail. Which is wild; it means you can only generate a random number of ULIDs per millisecond, and there's a chance you can only generate one. Even worse, if you can ever manage to cause a system to generate a ULID for you in the same millisecond as it generates one for your target, you can trivially guess the other ULID! Many use cases for UIDs assume they're effectively unguessable, but this is not true for compliant ULIDs.
The ULID spec is, a best, a bit underbaked. (And as others have noted, I find that Crockford's base32 encoding is a suboptimal choice.)
1: https://github.com/svix/rust-ksuid 2: https://github.com/svix/python-ksuid
People aren’t meant to use uuids as tokens, and they aren’t supposed to use PKs from a DB for this either - but they do. Because UUID v4 is basically crypto random, I think we’ve been getting away with a bunch of security weaknesses that would otherwise be exploited.
With UUID v7 we are back to 32bits of actually random data. It’s going to require some good educating to teach devs that uuids are guessable.
[edit] Looks like I am off base with the guess-ability of the V7 UUID, as the draft recommends CSPRNG for the random bits, and the amount of entropy is at least 74 bits and it is specifically designed to be “unguessable”. It does say “UUID v4” for anything security related, but perhaps that is simply in regard to the time stamp?
By naming them UUIDv4 and UUIDv7, is it going to be this never ending confusion for people to have to remember which one is good for databases and which one good for one time tokens?
Not sure what the backwards compatible solution here is either.
In elixir the function is UUID.uuid4() to generate a v4 UUID.
So we could theoretically scan code for its use I suppose. But all this increases chances of errors.
Yes, because this is what's been happening already with the past versions. It's not just sequential and random, there are also hash-based UUIDs. They shouldn't have sequential (heh) version numbers.
Compare that with otherwise nice ISO 8601 datetime format (e.g. 2023-06-28T21:47:59+00:00): it requires conversion for file systems that don't allow colons and plus signs.
The UUIDv7 properties are interesting, but it's worth noting that at least one DBMS really doesn't like the K-sortable property: https://cloud.google.com/spanner/docs/schema-design#uuid_pri...
Super useful and I honestly don't wanna go back to a project that doesn't use this pattern.
https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...
I have one idea which is perhaps nerdy enough to make the list but I've never fully fleshed it out, it's that one can encode the nonnegative integers {0, 1, 2, ...} into the finite bitstrings {0, 1}* in a way which preserves ordering.
So if we use hexits for the encoding the idea would be that 0=0, 1=1, ... E=14, then
F00 = 15
F01 = 16
...
F0F = 30
F100 = 31
F101 = 32
...
F1FF = 286
F2000 =
so the format is F, which is the overflow sigil, followed by a recursive representation of the length of the coming string, followed by a string of hexits that long.What if you need 16 hexits? That's where the recursion comes in,
F F00 0123456789ABCDEF
\ \ \
\ \ \----- 16 hexits
\ \
\ \-- the number 15, "there are 15+1 digits to follow"
\ (consisting of overflow, 0+1 digits to follow, and hex 0)
\
\--- overflow sigil
Kind of goofy but would allow a bunch of things like "timestamp * 1024 + 10-bit machine ID" etc without worrying about the size of the numbers involvedNote that this RFC also supports lexicographic sorting for negative numbers using the 10k complement notation.
I debugged a system that used the equivalent of time-ordered UUIDs where even though the database was horizontally scaled, the performance was limited by the capacity of one server. It was exactly because the uuids being generated started with similar prefixes, so at any given time, all the current data was going to whichever database node was responsible for that range of the keyspace.
This is more optimal for Postgres while making it slightly more difficult to interop between the db and the language (db driver needs to handle custom types, and you need to inject a custom type converter).
And while there are hacks you can do to make storing uuid-alikes as strings less terrible for db engines, if you want the best performance and smallest space consumption (compressed or not) make sure to use native ID types or convert to BINARY/numeric types.
Not implementing this isn’t the end of the world, but it makes refetching individual bits of data in your cache far easier. It’s really nice being able to implement it basically for free.
Example:
|-A-|-|------------B--------------|
NMSPC-9TWN1-HR7SV-MTX00-0H8VP-YCCJZ
A = Namespace, padded to 5 chars. Max 5 chars. Uppercase.
B = Blake3 hashed microtime with a random key.
I like how it folds in a time component but that it also doesn't reveal the time it was generated.Here's the snippet: https://gist.github.com/jszym/d3c7907b7b6e916f68205c99e5e489...
So it's got all of the above perks, but it also an actual UUID and fits in a Postgres UUID column.
It's very cool to be able to resolve a UUID to a particular database table and record with almost zero performance overhead (cached table lookup + indexed record select).
The goals are smallness, uniqueness, monotonicity, resistance to enumeration attacks, etc. Not randomness for randomness sake.
My UUIDv7+ can be consumed as a standard UUIDv7. It is not intended to be v8. A program can treat the last 16 bits as random noise if it wants.
For example, the PostgresSQL implementation of TypeID, would let you use a "domain type" to define a typeid subtype. Thus ensuring that the database itself always checks the validity of the type prefix. An example is here: https://github.com/jetpack-io/typeid-sql/blob/main/example/e...
In go, we're considering it making it easy to define a new Go type, that enforces a particular type prefix. If you can do that, then the Go type system would enforce you are passing the correct type of id.
Example:
I have userId=4 and userId=2. Suppose a user can have multiple bank accounts and userId=4 has accountId=5 and accountId=6 and a defaultAccound accountId=5. userId=2 has an account, accountId=7; I want to send userId=4 some money so I use the function `sendUserMoneyFromAccount(to: int, from: int)`. This is a bad interface but these things exist in the wild a lot. I could accidentally assume that because I want to send userId=4 the money to their default account that I would call it using `sendUserMoneyFromAccount(4, 7)` and that would work, but if under the hood it wants 2 accountIds, I've just sent accountId=4 money rather than userId=4's defaultAccount, accountId=5.
With prefixed ids that indicate type, a function that assumes type differently from the one supplied will not accidentally succeed.
In addition, humans who copy ids will be less likely to mistake them. This is just an ergonomic/human centric typing.
One question: I’ve seen people use both integer IDs (as the primary key) and prefixed GUIDs (for APIs) on the same table. This seems confusing and wasteful. Is there a valid reason to do that, and is it common? Performance when doing joins or something?
EDIT: it sounds like that’s the reason for them to be “K-Sortable”:
> TypeIDs are K-sortable and can be used as the primary key in a database while ensuring good locality. Compare to entirely random global ids, like UUIDv4, that generally suffer from poor database locality.
{timestamp:64, worker_id:48, seq: 16}
where the seq is incremented if the unique id is requested within the same millisecond.Personally, I rarely find I need ids to be sortable, so I just go with pure randomness.
I also like to split out a part of the random section into a tag that is easier for me to visually scan and match up ids in logs etc.
I call my ID format afids [0]
We might add a warning in the future if you decode/encode something that is not v7, but if it suits your use-case to encode UUIDv4 in this way, go for it. Just keep in mind that you'll lose the locality property.
Although, if stored in a DB, I would probably split the tag and the id in 2 separate columns, because DBMS often have a dedicated efficient UUID field type.
A little client code wrapper can make that transparent.
This property is useful for RDBMS writes, when the UUID is used as a primary key and this locality ensures that fewer slotted pages need to be modified to write the same amount of data.
Is there documentation covering the scenarios on how the order can become out of sync and what the odds are? There's a big difference between "almost" and "always" if we're talking about using this as a database PK.
Let's say I have structs:
type User struct { ID TypeID }
type Post struct { ID TypeID }
How can I ensure the correct type is used in each of the structs?
interface ITypeIdPrefix
{
static abstract string Prefix { get; }
}
abstract class TypeId<T>
where T : TypeId<T>, ITypeIdPrefix, new()
{
public Guid Id { get; private init; }
public override string ToString() => $"{T.Prefix}_{Id.ToBase32String()}";
// Override GetHashcode(), Equals(), etc.
public static bool TryParse(string s, out T? result)
{
if (!s.StartsWith(T.Prefix) || !TrySplitStringAndParseBase32ToGuid(s, out var id))
{
result = default;
return false;
}
result = new T { Id = id };
return true;
}
}
class UserId : TypeId<UserId>, ITypeIdPrefix
{
public static string Prefix => "user";
}
class PostId : TypeId<PostId>, ITypeIdPrefix
{
public static string Prefix => "post";
}What does this mean in more words?
In practice, you often want to select database entries which were inserted near each other in time. Ex: you would like to select the most recent entries or entries within a time frame. Even when selecting entries by other information, entries inserted closer to each other in time are generally more likely to be related.
Fetching entries near each other in memory is faster, so it would be nice to insert entries sequentially in time; we want entries inserted near each other in time to be located near each other in memory.
This is what counter-based indexing does: the database has a counter which increments on each insertion, and the current value becomes the inserted entry's id. But the problem with counters is when the database is distributed and insertions are happening in parallel, and you definitely don't want to sync the counter because that's way too slow.
UUIDv7 combines a sort of counter (Unix time) with a randomly-generated number. The counter bytes are first, so they determine the sort order; but in case 2 entries get inserted at the same time, the randomly-generated number keeps them distinct and totally ordered.
If you don't need either of those, then UUIDv7 is the right choice.