My notes on Gitlab's Postgres schema design (2022) (opens in new tab)

(shekhargulati.com)

488 pointsdaigoba662y ago161 comments

161 comments

> It is generally a good practice to not expose your primary keys to the external world. This is especially important when you use sequential auto-incrementing identifiers with type integer or bigint since they are guessable.

What value would there be in preventing guessing? How would that even be possible if requests have to be authenticated in the first place?

I see this "best practice" advocated often, but to me it reeks of security theater. If an attacker is able to do anything useful with a guessed ID without being authenticated and authorized to do so, then something else has gone horribly, horribly, horribly wrong and that should be the focus of one's energy instead of adding needless complexity to the schema.

The only case I know of where this might be valuable is from a business intelligence standpoint, i.e. you don't want competitors to know how many customers you have. My sympathy for such concerns is quite honestly pretty low, and I highly doubt GitLab cares much about that.

In GitLab's case, I'm reasonably sure the decision to use id + iid is less driven by "we don't want people guessing internal IDs" and more driven by query performance needs.

tetha2y ago

> I see this "best practice" advocated often, but to me it reeks of security theater. If an attacker is able to do anything useful with a guessed ID without being authenticated and authorized to do so, then something else has gone horribly, horribly, horribly wrong and that should be the focus of one's energy instead of adding needless complexity to the schema.

Yes, but the ability to guess IDs can make this security issue horrible, or much much worse.

If you had such a vulnerability and you are exposing the users to UUIDs, now people have to guess UUIDs. Even a determined attacker will have a hard time doing that or they would need secondary sources to get the IDs. You have a data breach, but you most likely have time to address it and then you can assess the amount of data lost.

If you can just <seq 0 10000 | xargs -I ID curl service/ticket/ID> the security issue is instantly elevated onto a whole new level. Suddenly all data is leaked without further effort and we're looking at mandatory report to data protection agencies with a massive loss of data.

To me, this is one of these defense in depth things that should be useless. And it has no effect in many, many cases.

But there is truely horrid software out there that has been popped in exactly the described way.

dijit2y ago

Case in point, a recent security issue Gitlab experienced (CVE-2023-7028; arbitrary password reset by knowing one of the accounts associated mail addresses) was made worse by a feature of gitlab that few people know about; that the "userID" is associated with a meta/internal mail address.

This meant that people could send password resets for any user if they knew their userID. The mail format was like user-1@no-reply.gitlab.com or something.

Since it's a safe bet that "user ID 1" is an admin user, someone weaponised this.

plagiarist2y ago

I've already resolved to never use Gitlab entirely on the basis of that CVE but that makes it worse.

Password resets should just never go to an email that hasn't been deliberately attached to an account by the account's owner, full stop. There should not be a code path where it is possible to send any such thing to arbitrary emails. And redirect emails should never be treated as account emails in any way.

yellowapple2y ago

Even without that auto-incrementing ID, there are plenty of other options for guessing valid email addresses to use with that exploit. For example, if you're able to figure out the format an organization uses for their email addresses (e.g. first.last@company.com), and you're able to figure out who works at that org (via e.g. LinkedIn), then there's a very good chance you can reset passwords for, say, the company's CTO or other likely-highly-privileged users.

That is: this kind of proves my point. Removing autoincrementing IDs from the equation is of minimal benefit when things have already gone horribly horribly wrong like this. It's a little bit more work on the attacker's part, but not by anywhere near enough for such a "mitigation" to be of much practical benefit.

s4i2y ago

It’s mentioned in the article. It’s more to do with business intelligence than security. A simple auto-incrementing ID will reveal how many total records you have in a table and/or their growth rate.

> If you expose the issues table primary key id then when you create an issue in your project it will not start with 1 and you can easily guess how many issues exist in the GitLab.

no-dr-onboard2y ago

Business intelligence isn’t really applicable on the database level with guids..way too many abstraction layers down.

coldtea2y ago

>I see this "best practice" advocated often, but to me it reeks of security theater.

The idea of "security theater" is overplayed. Security can be (and should be) multilayered, it doesn't have to be all or nothing. So that, when they break a layer (say the authentication), they shouldn't automatically gain easy access to the others

>If an attacker is able to do anything useful with a guessed ID without being authenticated and authorized to do so, then something else has gone horribly, horribly, horribly wrong and that should be the focus of one's energy instead of adding needless complexity to the schema.

Sure. But by that time, it's will be game over if you don't also have the other layers in place.

The thing is that you can't anticipate any contigency. Bugs tend to not preannounce themselves, especially tricky nuanced bugs.

But when they do appear, and a user can "do [something] useful with an ID without being authenticated and authorized to do so" you'd be thanking all available Gods that you at least made the IDs not guassable - which would also give them also access to every user account on the system.

yellowapple2y ago

> Security can be (and should be) multilayered, it doesn't have to be all or nothing.

In this case the added layer is one of wet tissue paper, at best. Defense-in-depth is only effective when the different layers are actually somewhat secure in their own right.

It's like trying to argue that running encrypted data through ROT13 is worthwhile because "well it's another layer, right?".

> you'd be thanking all available Gods that you at least made the IDs not guassable - which would also give them also access to every user account on the system.

I wouldn't be thanking any gods, because no matter what those IDs look like, the only responsible thing in such a situation is to assume that an attacker does have access to every user account on the system. Moving from sequential IDs to something "hard" like UUIDs only delays the inevitable - and the extraordinarily narrow window in which that delay is actually relevant ain't worth considering in the grand scheme of things. Moving from sequential IDs to something like usernames ain't even really an improvement at all, but more of a tradeoff; yeah, you make life slightly harder for someone trying to target all users, but you also make life much easier for someone trying to target a specific user (since now the attacker can guess the username directly - say, based on other known accounts - instead of having to iterate through opaque IDs in the hopes of exposing said username).

coldtea2y ago

>I wouldn't be thanking any gods, because no matter what those IDs look like, the only responsible thing in such a situation is to assume that an attacker does have access to every user account on the system. Moving from sequential IDs to something "hard" like UUIDs only delays the inevitable*

Well, there's nothing "inevitable". It's a computer system, not the fullfilment of some prophecy.

You can have an attack vector giving you access to a layer, without guaranteed magic access to other layers.

But even if it "just delays the inevitable", that's a very good thing, as it can be time used to patch the issue.

Not to mention, any kind of cryptography just "delays the inevitable" too. With enough time it can be broken with brute force - might not even take millions of years, as we could get better at quantum computing in the next 50 or 100.

1 more reply

metafunctor2y ago

Bugs happen also in access control. Unguessable IDs make it much harder to exploit some of those bugs. Of course the focus should be on ensuring correct access control in the first place, but unguessable IDs can make the difference between a horrible disaster and a close call.

It's also possible to use auto-incrementing database IDs and encrypt them, if using UUIDs doesn't work for you. With appropriate software layers in place, encrypted IDs work more or less automatically.

lordgrenville2y ago

> The only case where this might be valuable is business intelligence

Nitpick: I would not call this "business intelligence" (which usually refers to internal use of the company's own data) but "competitive intelligence". https://en.wikipedia.org/wiki/Competitive_intelligence

EE84M3i2y ago

See also "German Tank Problem" https://en.m.wikipedia.org/wiki/German_tank_problem

remus2y ago

In general it's a defense-in-depth thing. You definitely shouldn't be relying on it, but as an attacker it just makes your life a bit harder if it's not straightforward to work out object IDs.

For example, imagine you're poking around a system that uses incrementing ints as public identifiers. Immediately, you can make a good guess that there's probably going to be some high privileged users with user_id=1..100 so you can start probing around those accounts. If you used UUIDs or similar then you're not leaking that info.

In gitlabs case this is much less relevant, and it's more fo a cosmetic thing.

serial_dev2y ago

> In gitlabs case this is much less relevant (...)

Why, though? GitLab is often self hosted, so being able to iterate through objects, like users, can be useful for an attacker.

yellowapple2y ago

In my experience self-hosted GitLabs are rarely publicly-accessible in the first place; they're usually behind some sort of VPN.

As for an attacker being able to iterate through users, if that information is supposed to be private, and yet an attacker is getting anything other than a 404, then that's a problem in and of itself and my energy would be better spent fixing that.

1 more reply

remus2y ago

You're right, fair point.

worksonmine2y ago

> What value would there be in preventing guessing?

It prevents enumeration, which may or may not be a problem depending on the data. If you want to build a database of user profiles it's much easier with incremental IDs than UUID.

It is at least a data leak but can be a security issue. Imagine a server doing wrong password correctly returning "invalid username OR password" to prevent enumeration. If you can still crawl all IDs and figure out if someone has an account that way it helps filter out what username and password combinations to try from previous leaks.

Hackers are creative and security is never about any single protection.

yellowapple2y ago

> If you can still crawl all IDs and figure out if someone has an account that way it helps filter out what username and password combinations to try from previous leaks.

Right, but like I suggested above, if you're able to get any response other than a 404 for an ID other than one you're authorized to access, then that in and of itself is a severe issue. So is being able to log in with that ID instead of an actual username.

Hackers are indeed creative, but they ain't wizards. There are countless other things that would need to go horribly horribly wrong for an autoincrementing ID to be useful in an attack, and the lack of autoincrementing IDs doesn't really do much in practice to hinder an attacker once those things have gone horribly, horribly wrong.

I can think of maybe one exception to this, and that's with e-commerce sites providing guest users with URLs to their order/shipping information after checkout. Even this is straightforward to mitigate (e.g. by generating a random token for each order and requiring it as a URL parameter), and is entirely inapplicable to something like GitLab.

worksonmine2y ago

> Right, but like I suggested above, if you're able to get any response other than a 404 for an ID other than one you're authorized to access, then that in and of itself is a severe issue. So is being able to log in with that ID instead of an actual username.

You're missing the point and you're not thinking like a hacker yet. It's not about the ID itself or even private profiles, but the fact that you can build a database of all users with a simple loop. For example your profile here is '/user?id=yellowapple' not '/user?id=1337'.

If it was the latter I could build a list of usernames by testing all IDs. Then I would cross-reference those usernames to previous leaks to know what passwords to test. And hacking an account is not the only use of such an exploit, just extracting all items from a competitors database is enough in some cases. It all depends on the type data and what business value it has. Sometimes an incrementing ID is perfectly fine, but it's more difficult to shard across services so I usually default to UUID anyway except when I really want an incrementing ID.

Most of the time things don't have to go "horribly horribly wrong" to get exploited. It's more common to be many simple unimportant holes cleverly combined.

The username can still always be checked for existence on the sign-up step, and there aren't many ways of protecting from that. But it's easier to rate-limit sign-ups (as one should anyway) than viewing public profiles.

Do you leave your windows open when you leave from home just because the burglar can kick the front door in instead? It's the same principle.

1 more reply

JimBlackwood2y ago

I follow this best practice, there’s a few reasons why I do this. It doesn’t have to do with using a guessed primary ID for some sort of privilege escalation, though. It has more to do with not leaking any company information.

When I worked for an e-commerce company, one of our biggest competitors used an auto-incrementing integer as primary key on their “orders” table. Yeah… You can figure out how this was used. Not very smart by them, extremely useful for my employer. Neither of these will allow security holes or leak customer info/payment info, but you’d still rather not leak this.

tomnipotent2y ago

> extremely useful for my employer.

I've been in these shoes before, and finding this information doesn't help you as an executive or leader make any better decisions than you could have before you had the data. No important decision is going to be swayed by something like this, and any decision that is probably wasn't important.

Knowing how many orders is placed isn't so useful without average order value or items per cart, and the same is true for many other kinds of data gleamed from this method.

JimBlackwood2y ago

That’s not correct. Not every market is the same in it’s dynamics.

Yes, most of the time that information was purely insightful and was simply monitored. However, at some moments it definitely drove important decisions.

1 more reply

kehers2y ago

One good argument I found [^1] about not exposing primary keys is that primary keys may change (during system/db change) and you want to ensure users have a consistent way of accessing data.

[^1]: https://softwareengineering.stackexchange.com/questions/2183...

dalore2y ago

It's also exposes your growth metrics. When using sequential id's one can tell how many users you have, how many users a month you are getting and all sorts of useful stuff that you probably don't want to expose.

It's how the British worked out how many tanks the German army had.

SkyMarshal2y ago

> This is especially important when you use sequential auto-incrementing identifiers with type integer or bigint since they are guessable.

I thought we had long since moved past that to GUIDs or UUIDs for primary keys. Then if you still need some kind of sequential numbering that has meaning in relation to the other fields, make a separate column for that.

sgarland2y ago

Except now people are coming back around, because they’re realizing (as the article mentions) that UUID PKs come with enormous performance costs. In fairness, any non-k-sortable ID will suffer the same fate, but UUIDs are the most common of the bunch.

strzibny2y ago

Whether you think it's a real problem or not, if you want to solve it somehow, I compiled the current options one have in Rails apps:

https://nts.strzibny.name/alternative-bigint-id-identifiers-...

cyberfart2y ago

There are also reasons outside infosec concerns. For example where such PKs would be directly related to your revenue, such as orders in an e-commerce platform. You wouldn't want competitors to have an estimate of your daily volume, that kind of thing.

mnahkies2y ago

It can be really handy for scraping/archiving websites if they're kind enough to use a guessable id

1 more reply

heax2y ago

It really depends but useful knowledege can be derived from this. If user accounts use sequential ids the id 1 is most likely the admin account that is created as first user.

kelnos2y ago

> For example, Github had 128 million public repositories in 2020. Even with 20 issues per repository it will cross the serial range. Also changing the type of the table is expensive.

I expect the majority of those public repositories are forks of other repositories, and those forks only exist so someone could create pull requests against the main repository. As such, they won't ever have any issues, unless someone makes a mistake.

Beyond that, there are probably a lot of small, toy projects that have no issues at all, or at most a few. Quickly-abandoned projects will suffer the same fate.

I suspect that even though there are certainly some projects with hundreds and thousands of issues, the average across all 128M of those repos is likely pretty small, probably keeping things well under the 2B limit.

Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs, github.com included.

zX41ZdbW2y ago

It is still under the limit today with 362,107,148 repositories and 818,516,506 unique issues and pull requests:

https://play.clickhouse.com/play?user=play#U0VMRUNUIHVuaXEoc...

ssalka2y ago

I'm guessing this won't be including issues & PRs from private repos, which could be substantial

zx80802y ago

Elapsed: 12.618 sec, read 7.13 billion rows, 42.77 GB

This is too long, seems the ORDER BY is not set up correctly for the table.

zx80802y ago

Also,

> `repo_name` LowCardinality(String),

This is not a low cardinality:

7133122498 = 7.1B

Don't use low cardinality for such columns!

1 more reply

zX41ZdbW2y ago

This is an ad-hoc query. It does a full scan, processing slightly less than a billion rows per second on a single machine, and finishes in a reasonable time with over 7 billion events on GitHub from 2015. While it does not make sense to optimize this table for my particular query, the fact that it works well for arbitrary queries is worth noting.

nly2y ago

That query took a long time

istvanu2y ago

I'm convinced that GitHub's decision to move away from Rails was partly influenced by a significant flaw in ActiveRecord: its lack of support for composite primary keys. The need for something as basic as PRIMARY KEY(repo_id, issue_id) becomes unnecessarily complex within ActiveRecord, forcing developers to use workarounds that involve a unique key alongside a singular primary key column to meet ActiveRecord's requirements—a less than ideal solution.

Moreover, the use of UUIDs as primary keys, while seemingly a workaround, introduces its own set of problems. Despite adopting UUIDs, the necessity for a unique constraint on the (repo_id, issue_id) pair persists to ensure data integrity, but this significantly increases the database size, leading to substantial overhead. This is a major trade-off with potential repercussions on your application's performance and scalability.

This brings us to a broader architectural concern with Ruby on Rails. Despite its appeal for rapid development cycles, Rails' application-level enforcement of the Model-View-Controller (MVC) pattern, where there is a singular model layer, a singular controller layer, and a singular view layer, is fundamentally flawed. This monolithic approach to MVC will inevitably lead to scalability and maintainability issues as the application grows. The MVC pattern would be more effectively applied within modular or component-based architectures, allowing for better separation of concerns and flexibility. The inherent limitations of Rails, especially in terms of its rigid MVC architecture and database management constraints, are significant barriers for any project beyond the simplest MVPs, and these are critical factors to consider before choosing Rails for more complex applications.

silasb2y ago

Do you have any sources on GitHub moving away from Rails? This is the first that I've heard and my googlefu has returned zero results around this. Just last year they had a blog post around Building GitHub with Ruby and Rails[0] so your remark caught my off guard.

[0]: https://github.blog/2023-04-06-building-github-with-ruby-and...

modderation2y ago

Perhaps too late, but Rails 7.1[1] introduced composite primary key support, and there's been a third-party gem[2] offering the functionality for earlier versions of ActiveRecord.

[1] https://guides.rubyonrails.org/7_1_release_notes.html#compos...

[2] https://github.com/composite-primary-keys/composite_primary_...

rglynn2y ago

Whilst I would agree that a monolith can run into scalability issues, I am not sure your characterisation of Rails as such is proportionate.

To say that Rails' architecture is a "sigificant barrier for any project beyond the simplest MVPs" is rather hyperbolic, and the list of companies running monolithic Rails apps is a testament to that.

On this very topic, I would recommend reading GitLab's own post from 2022 on why they are sticking with a Rails monolith[1].

[1] - https://about.gitlab.com/blog/2022/07/06/why-were-sticking-w...

zachahn2y ago

I can't really comment on GitHub, but Rails supports composite primary keys as of Rails 7.1, the latest released version [1].

About modularity, there are projects like Mongoid which can completely replace ActiveRecord. And there are plugins for the view layer, like "jbuilder" and "haml", and we can bypass the view layer completely by generating/sending data inside controller actions. But fair, I don't know if we can completely replace the view and controller layers.

I know I'm missing your larger point about architecture! I don't have so much to say, but I agree I've definitely worked on some hard-to-maintain systems. I wonder if that's an inevitability of Rails or an inevitability of software systems—though I'm sure there are exceptional codebases out there somewhere!

[1] https://guides.rubyonrails.org/7_1_release_notes.html#compos...

mvdtnz2y ago

Do we know for sure if gitlab cloud uses a multi-tenanted database, or a db per user/customer/org? In my experience products that offer both a self hosted and cloud product tend to prefer a database per customer, as this greatly simplifies the shared parts of the codebase, which can use the same queries regardless of the hosting type.

If they use a db per customer then no one will ever approach those usage limits and if they do they would be better suited to a self hosted solution.

tcnj2y ago

Unless something has substantially changed since I last checked, gitlab.com is essentially self-hosted gitlab ultimate with a few feature flags to enable some marginally different behaviour. That is, it uses one multitennant DB for the whole platform.

karolist2y ago

Not according to [1] where the author said

> This effectively results in two code paths in many parts of your platform: one for the SaaS version, and one for the self-hosted version. Even if the code is physically the same (i.e. you provide some sort of easy to use wrapper for self-hosted installations), you still need to think about the differences.

1. https://yorickpeterse.com/articles/what-it-was-like-working-...

Maxion2y ago

I've toyed with various SaaS designs and multi tenanted databses always come to th forefront of my mind. It seems to simplify the architecture a lot.

rapfaria2y ago

> Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs

A bomb defused in a migration that takes eleven seconds

aeyes2y ago

The migration has to rewrite the whole table, bigint needs 8 bytes so you have to make room for that.

I have done several such primary key migrations on tables with 500M+ records, they took anywhere from 30 to 120 minutes depending on the amount of columns and indexes. If you have foreign keys it can be even longer.

Edit: But there is another option which is logical replication. Change the type on your logical replica, then switch over. This way the downtime can be reduced to minutes.

semiquaver2y ago

In practice the only option that I’ve seen work for very large teams and very large relational databases is online schema change tools like https://github.com/shayonj/pg-osc and https://github.com/github/gh-ost (the latter developed for GitHub’s monolith). It’s just too difficult to model what migrations will cause problems under load. Using a binlog/shadowtable approach for all migrations mostly obviates the problem.

2 more replies

winrid2y ago

This largely depends on the disk. I wouldn't expect that to take 30mins on a modern NVME drive, but of course it depends on table size.

2 more replies

tengbretson2y ago

In JavaScript land, postgres bigints deserialize as strings. Is your application resilient to this? Are your downstream customers ready to handle that sort of schema change?

Running the db migration is the easy part.

Rapzid2y ago

Depends on the lib. Max safe int size is like 9 quadrillion. You can safely deserialize serial bigints to this without ever worrying about hitting that limit in many domains.

1 more reply

jameshart2y ago

11 seconds won't fix all your foreign keys. And all the code written against it that assumes an int type will accommodate the value.

iurisilvio2y ago

Migrating primary keys from int to bigint is feasible. Requires some preparation and custom code, but zero downtime.

I'm managing a big migration following mostly this recipe, with a few tweaks: http://zemanta.github.io/2021/08/25/column-migration-from-in...

FKs, indexes and constraints in general make the process more difficult, but possible. The data migration took some hours in my case, but no need to be fast.

AFAIK GitLab has tooling to run tasks after upgrade to make it work anywhere in a version upgrade.

golergka2y ago

Being two orders of magnitude away from running out of ids is too close for comfort anyway.

justinclift2y ago

> github.com included

Typo?

zetalyrae2y ago

The point about the storage size of UUID columns is unconvincing. 128 bits vs. 64 bits doesn't matter much when the table has five other columns.

A much more salient concern for me is performance. UUIDv4 is widely supported but is completely random, which is not ideal for index performance. UUIDv7[0] is closer to Snowflake[1] and has some temporal locality but is less widely implemented.

There's an orthogonal approach which is using bigserial and encrypting the keys: https://github.com/abevoelker/gfc64

But this means 1) you can't rotate the secret and 2) if it's ever leaked everyone can now Fermi-estimate your table sizes.

Having separate public and internal IDs seems both tedious and sacrifices performance (if the public-facing ID is a UUIDv4).

I think UUIDv7 is the solution that checks the most boxes.

[0]: https://uuid7.com/

[1]: https://en.wikipedia.org/wiki/Snowflake_ID

Merad2y ago

> The point about the storage size of UUID columns is unconvincing. 128 bits vs. 64 bits doesn't matter much when the table has five other columns.

But it's not just the size of that one column, it's also the size of all the places that id is used as a FK and the indexes that may be needed on those FK columns. Think about something like a user id that might be referenced by dozens or even hundreds of FKs throughout your database.

solidsnack90002y ago

...and this has not just a size impact but also a substantial performance impact.

paulddraper2y ago

> 128 bits vs. 64 bits doesn't matter much when the table has five other columns.

!!!!

But those 5 other columns are not indexed.

---

There are three levels of database performance:

1. Indices and data fit in memory.

2. Indices fits in memory, data does not.

3. Neither indices not data fit in memory.

If you can do #1 great, but if you don't have that, fight like a madman for #2.

---

Doubling your index sizes is just makes it harder.

gfody2y ago

think of the primary keys in a database like typedef void* ie it's your fundamental pointer and the size of it will impact every aspect of performance throughout - memory/disk footprint and corresponding throughput bottlenecks, cpu time comparing keys which is what every operation reduces to in the deepest inner-most loops of joins and lookups etc.

when x86-64 cpus were new the performance impact from switching to 64-bit pointers was so bad we had to create x32/ilp32 and the reason .NET still has "prefer 32-bit" as a default even today.

using 128-bit uuids as PKs in a database is an awful mistake

neonsunset2y ago

Prefer 32-bit does nothing for modern .NET targets. This is actually the first time I've heard the term being used in many years, even back in .NET Framework 4.6.x days it wasn't much of a concern - the code would be executed with 64-bit runtime as a default on appropriate hosts.

gfody2y ago

the 32bitpref corflag isn't part of a .net core target since those are always il, it is more properly a runtime concern as it should be

it's still the default in .net as of 4.8.1 (has been since it was introduced in 4.5 roughly coinciding w/java's pointer compression feature which is also still the default today)

1 more reply

s4i2y ago

The v7 isn’t a silver bullet. In many cases you don’t want to leak the creation time of a resource. E.g. you want to upload a video a month before making it public to your audience without them knowing.

solidsnack90002y ago

> There's an orthogonal approach which is using bigserial and encrypting the keys...

Another variant of this approach: https://pgxn.org/dist/permuteseq/

It is also feasible to encrypt the value on display (when placing it in URLs, emails, &c):

https://wiki.postgresql.org/wiki/Pseudo_encrypt

This maintains many of the benefits of sequential indexes and does allow you to change the key. However, if the key is changed, it would break any bookmarks, invalidate anything sent in older emails -- it would have the same effect as renaming everything.

nine_k2y ago

It very much does when you have a ton of FKs (enforced or not) using such a column, and thus indexed and used in many joins. Making it twice as hard for the hot part of an index to fit to RAM is never good for performance, nor for the cloud bill.

If you have a column that is used in many joins, there are performance reasons to make it as compact as possible (but not smaller).

newaccount7g2y ago

If I’ve learned anything in my 7 years of software development it’s that this kind of expertise is just “blah blah blah” that will get you fired. Just make the system work. This amount of trying to anticipate problems will just screw you up. I seriously can’t imagine a situation where knowing this would actually improve the performance noticeably.

canadiantim2y ago

Would it ever make sense to have a uuidv7 as primary key but then anther slug field for a public-id, e.g. one that is shorter and better in a url or even allowing user to customize it?

Horffupolde2y ago

Yes sure but now you have to handle two ids and guaranteeing uniqueness across machines or clusters becomes hard.

vrosas2y ago

That and a uuid is going to be unique across all tables and objects, whereas a slug will only be unique within a certain subset e.g. users within an organization. I’ve seen a production issue IRL where someone (definitely not me) wrote a query fetching objects by slug and forgot to include the ‘AND parent_slug = xxx’

traceroute662y ago

Slight nit-pick, but I would pick up the author on the text vs varchar section.

The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".

This horse bolted a long time ago. Its not "not much", its "none".

The Postgres Wiki[1] explicitly tells you to use text unless you have a very good reason not to. And indeed the docs themselves[2] tell us that "For many purposes, character varying acts as though it were a domain over text" and further down in the docs in the green Tip box, "There is no performance difference among these three types".

Therefore Gitlab's use of (mostly) text would indicate that they have RTFM and that they have designed their schema for their choice of database (Postgres) instead of attempting to implement some stupid "portable" schema.

[1] https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use... [2] https://www.postgresql.org/docs/current/datatype-character.h...

alex_smart2y ago

>The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".

They then also show that there is in fact a significant performance difference when you need to migrate your schema to accodomate a change in length of strings being stored. Altering a table to a change a column from varchar(300) to varchar(200) needs to rewrite every single row, where as updating the constraint on a text column is essentially free, just a full table scan to ensure that the existing values satisfy your new constraints.

FTA:

>So, as you can see, the text type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks.

traceroute662y ago

> They then also show that there is in fact a significant performance difference when you need to migrate your schema to accodomate a change in length of strings being stored.

Which is a pointless demonstration if you RTFM and design your schema correctly, using text, just like the manual and the wiki tells you to.

> the text type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks.

Which is exactly what the manual tells you ....

"For many purposes, character varying acts as though it were a domain over text"

alex_smart2y ago

And what is wrong with someone figuring out for themselves and explaining to others why some suggestion makes sense logically, rather than just quoting the manual?

1 more reply

exabrial2y ago

Foreign keys are expensive is an oft repeated rarely benched claim. There are tons of ways to do it incorrectly. But in your stack you are always enforcing integrity _somewhere_ anyway. Leveraging the database instead of reimplementing it requires knowledge and experimentation, and it more often than not it will save your bacon.

bluerooibos2y ago

Has anyone written about or noticed the performance differences between Gitlab and GitHub?

They're both Rails-based applications but I find page load times on Gitlab in general to be horrific compared to GitHub.

golergka2y ago

I used Gitlab a few years ago, but then it had severe client-side performance problems on large pull requests. Github isn't ideal with them too, but it manages to be decent.

wdb2y ago

Yeah, good reason to spit up pull requests ;) I do think it improved a lot over the last two years, though

oohffyvfg2y ago

> compared to GitHub.

this is like comparing chrome and other browsers, even chromium based.

chrome and github will employ all tricks in the book, even if they screw you. for example, how many hours of despair I've wasted when manually dissecting a git history on employer github by opening merge diffs, hitting ctrl F, seeing no results and moving to the next... only to find on the 100th diff that deep down the diff lost they hid the most important file because it was more convenient for them (so one team lead could hit some page load metric and get a promotion)

heyoni2y ago

I mean GitHub in general has been pretty reliable minus the two outages they had last year and is usually pretty performant or I wouldn’t use their keyboard shortcuts.

There are some complaints here from a former dev about gitlab that might provide insight into its culture and lack of regard for performance: https://news.ycombinator.com/item?id=39303323

Ps: I do not use gitlab enough to notice performance issues but thought you might appreciate the article

imiric2y ago

> I mean GitHub in general has been pretty reliable minus the two outages they had last year

Huh? GitHub has had major outages practically every other week for a few years now. There are pages of HN threads[1].

There's a reason why githubstatus.com doesn't show historical metrics and uptime percentages: it would make them look incompetent. Many outages aren't even officially reported there.

I do agree that when it's up, performance is typically better than Gitlab's. But describing GH as reliable is delusional.

[1]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

heyoni2y ago

Delusional? Anecdotal maybe…I was describing my experience so thanks for elaborating.

I only use it as a code repository. Was it specific services within GitHub that failed a lot?

2 more replies

anoopelias2y ago

More comments on this submission: https://news.ycombinator.com/item?id=39333220

vinnymac2y ago

I always wondered what the purpose of that extra “I” was in the CI variables `CI_PIPELINE_IID` and `CI_MERGE_REQUEST_IID` were for. Always assumed it was a database related choice, but this article confirms it.

gfody2y ago

> 1 quintillion is equal to 1000000000 billions

it is pretty wild that we generally choose between int32 and int64. we really ought to have a 5 byte integer type which would support cardinalities of ~1T

klysm2y ago

Yeah it doesn't make sense to pick something that's not a power of 2 unless you are packing it.

postalrat2y ago

I guess that depends on how the index works.

gfody2y ago

we usually work with some page size anyways, 64 5-byte ints fit nicely into 5 512-bit registers, and ~1T is a pretty flexible cardinality limit after ~2B

azlev2y ago

It's reasonable to not have auto increment id's, but it's not clear to me if there is benefits to have 2 IDs, one internal and one external. This increases the number of columns / indexes, makes you always do a lookup first, and I can't see a security scenario where I would change the internal key without changing the external key. Am I missing something?

Aeolun2y ago

You always have the information at hand anyway when doing anything per project. It’s also more user friendly to have every project’s issues start with 1 instead of starting with two trillion, seven hundred billion, three hundred and five million, sevenhundred and seventeen thousand three hundred twentyfive.

josephg2y ago

> As I discussed in an earlier post[3] when you use Postgres native UUID v4 type instead of bigserial table size grows by 25% and insert rate drops to 25% of bigserial. This is a big difference.

Does anyone know why UUIDv4 is so much worse than bigserial? UUIDs are just 128 bit numbers. Are they super expensive to generate or something? Whats going on here?

AprilArcus2y ago

UUIDv4s are fully random, and btree indices expect "right-leaning" values with a sensible ordering. This makes indexing operations on UUIDv4 columns slow, and was the motivation for the development of UUIDv6 and UUIDv7.

couchand2y ago

I'm curious to learn more about this heuristic and how the database leverages it for indexing. What does right-leaning mean formally and what does analysis of the data structure look like in that context? Do variants like B+ or B* have the same charactersistics?

perrygeo2y ago

The 25% increase in size is true but it's 8 bytes, a small and predictable linear increase per row. Compared to the rest of the data in the row, it's not much to worry about.

The bigger issue is insert rate. Your insert rate is limited by the amount of available RAM in the case of UUIDs. That's not the case for auto-incrementing integers! Integers are correlated with time while UUID4s are random - so they have fundamentally different performance characteristics at scale.

The author cites 25% but I'd caution every reader to take this with a giant grain of salt. At the beginning, for small tables < a few million rows, the insert penalty is almost negligible. If you did benchmarks here, you might conclude there's no practical difference.

As your table grows, specifically as the size of the btree index starts reaching the limits of available memory, postgres can no longer handle the UUID btree entirely in memory and has to resort to swapping pages to disk. An auto-integer type won't have this problem since rows close in time will use the same index page thus doesn't need to hit disk at all under the same load.

Once you reach this scale, The difference in speed is orders of magnitude. It's NOT a steady 25% performance penalty, it's a 25x performance cliff. And the only solution (aside from a schema migration) is to buy more RAM.

stephen1232y ago

I think its because of btrees. Btrees and the pages work better if only the last page is getting lots of writes. Iuids cause lots of un ordered writes leading to page bloat.

barrkel2y ago

Random distribution in the sort order mean the cache locality of a btree is poor - instead of inserts going to the last page, they go all over the place. Locality of batch inserts is also then bad at retrieval time, where related records are looked up randomly later.

So you pay taxes at both insert time and later during selection.

eezing2y ago

We shouldn’t assume that this schema was designed all at once, but rather is the product of evolution. For example, maybe the external_id was added after the initial release in order to support the creation of unique ids in the application layer.

martinald2y ago

Is it just me that thinks in general schema design and development is stuck in the stone ages?

I mainly know dotnet stuff, which does have migrations in EF (I note the point about gitlab not using this kind of thing because of database compatibility). It can point out common data loss while doing them.

However, it still is always quite scary doing migrations, especially bigger ones refactoring something. Throw into this jsonb columns and I feel it is really easy to screw things up and suffer bad data loss.

For example, renaming a column (at least in EF) will result in a column drop and column create on the autogenerated migrations. Why can't I give the compiler/migration tool more context on this easily?

Also the point about external IDs and internal IDs - why can't the database/ORM do this more automatically?

I feel there really hasn't been much progress on this since migration tooling came around 10+ years ago. I know ORMs are leaky abstractions, but I feel everyone reinvents this stuff themselves and every project does these common things a different way.

Are there any tools people use for this?

sjwhevvvvvsj2y ago

One thing I like about hand designing schema is it makes you sit down and make very clear choices about what your data is, how it interrelates, and how you’ll use it. You understand your own goals more clearly.

doctor_eval2y ago

So many people I encounter seem to think it’s the code that’s important when building the back end of an application. You see this when people discussing database schemas start comparing, say, rails to hibernate. But ORMs emphasise code instead of data, which in my experience is a big mistake.

In my experience, getting the data structures right is 99% of the battle. If you get that right, the code that follows is simple and obvious.

For database applications, this means getting the schema right. To this end, I always start with the underlying table structures, and only start coding once I understand how the various tables are going to interact.

Sadly, too many people think of the database as the annoying hoops we jump through in order to store the results of our code. In my world, the code I write is the minimum required to safely manipulate the database; it’s the data that counts.

Some people seem to think I’m weird for starting with the database (and for using plpgsql), but I think it’s actually a superpower.

jupp0r2y ago

This is true for in memory data as well. Object oriented programming is great for some problems, but it's also limiting the way we think about data by putting it close to the code operating on it. ORMs do the same to databases by pretending that rows are objects when that's only one way of modeling your problem.

jimbokun2y ago

It’s part of the dirty little secret of why document databases and other NoSQL systems became popular.

Required even less up front thinking about how to model your data. Throw some blobs of JSON into Mongo or whatever, and worry about the rest later.

thom2y ago

It wasn’t always code first - you mention Hibernate but 15-20 years ago it was entirely feasible to inherit a database schema or design one up front, and then create performant metadata mappings to a usable object model. That sort of practise was tainted by the Bad Enterprise brushes of verbosity and XML in general, and so everyone moved to some flavour of active record. This allowed programmers to write less code and fit it into neater boxes, at a time when there was enormous demand for basic CRUD web apps, but a lot of power and expressiveness was lost.

Somewhat ironically, many modern enterprises have peeled all the way back to SQL for a huge amount of logic anyway, so I don’t think we’re done caring about database schemas quite yet.

1 more reply

sjwhevvvvvsj2y ago

Yup - and you can’t code your way to real scale either. At real scale the game is all about data structures. Code just gets them from A to B.

Or as they say at Google, the job of SWE is “moving protos”.

nly2y ago

Exactly that. Sitting down and thinking about your data structures and APIs before you start writing code seems to be a fading skill.

EvanAnderson2y ago

It absolutely shows in the final product, too.

I wish more companies evaluated the suitability of software based on reviewing the back-end data storage schema. A lot of sins can be hidden in the application layer but many become glaringly obvious when you look at how the data is represented and stored.

timacles2y ago

Theres no right abstraction for it because everyones data is different. From my experience what most developers dont realize is that data is more complex than code. Code is merely the stuff that sits on top of the data, shuffling it around... but designing and handling the data in an efficient way is the real engineering problem.

Any abstraction you could come up with wouldnt fit 90% of the other cases

Atotalnoob2y ago

EF core doesn’t to drop/create for columns in db providers that support renaming columns. It only does it for ones that don’t like MySQL or SQLite

winrid2y ago

Not a silver bullet for every project but the Django ORM largely solves this with its migrations. You define your table classes and it just generates the migrations.

Throw in a type checker and you're in pretty good shape.

Rust also has sqlx which will type check your code against the DB.

dxdm2y ago

I'm assuming this is why you say it's not a silver bullet, but to make it more explicit: the Django ORM will happily generate migrations that will lock crucial tables for long amounts of time and bring down your production application in the process.

You still need to know what SQL the migration will run (take a look at `manage.py sqlmigrate`) and most importantly how your database will apply it.

basil-rash2y ago

Dealing with a bunch of automigrate headaches in the Prisma ORM convinced me to just drop the layer entirely and write plain old SQL everywhere. It’s forced me to learn a bunch of new stuff, but the app runs faster now that I can optimize every query and migrations are much simpler with a single idempotent SQL setup script I can run to provision whatever deployment of the DB I need. I’m sure some problem spaces might benefit from all the additional complexity and abstraction, but the average app certainly can make do without for a long time.

1 more reply

mvdtnz2y ago

An ORM is NEVER the solution, ever ever ever. Repeat after me: ORMs are not the solution to this problem. They work in your little toy apps with 4 customers but they are nothing but pain on real enterprise grade software.

winrid2y ago

also, every company I've worked at used ORMs in some capacity. Sometimes, not always.

Also, I don't really work on any apps with only four customers. either they have almost a million or zero :P Try again. :)

winrid2y ago

I don't think insults are really necessary here.

Also ORMs can be very useful, just don't do dumb stuff, like with any technology.

I use them when appropriate.

cschmatzler2y ago

If you use MySQL, Planetscale’s branching is really amazing. Not using them, but wish I could for that. Gives you a complete diff of what you’re doing, and can also pre-plan migrations and only apply them when you need with their gating.

Merad2y ago

> Also the point about external IDs and internal IDs - why can't the database/ORM do this more automatically?

It has pretty big implications for how your application code interacts with the database. Queries that involve id's will need to perform joins in order to check the external id. Inserts or updates that need to set a foreign key need to perform an extra lookup to map the external id to the correct FK value (whether it's literally a separate query or a CTE/subquery). Those are things that are way outside the realm of what EF can handle automatically, at least as it exists today.

emodendroket2y ago

I think that stuff works about as well as it possibly could. If you think that's painful think about something like DynamoDB where if you didn't really think through the access patterns up front you're in for a world of pain.

rob1372y ago

I found this post very useful. I'm wondering where I could find others like it?

aflukasz2y ago

I recommend Postgres FM podcast, e.g. available as video on Postgres TV yt channel. Good content on its own, and many resources of this kind are linked in the episode notes. I believe one of the authors even helped Gitlab specifically with Postgres performance issues not that long ago.

rglynn2y ago

I like this, although a bit narrower - https://blog.mastermind.dev/indexes-in-postgresql

cosmicradiance2y ago

Here's another - https://zerodha.tech/blog/working-with-postgresql/

sidcool2y ago

Great read! And even better comments here.

firemelt2y ago

so anyone use schema.rb in production? even dhh once campfire use .sql instead schema.rb

j / k navigate · click thread line to collapse

161 comments

yellowapple2y ago

What value would there be in preventing guessing? How would that even be possible if requests have to be authenticated in the first place?

In GitLab's case, I'm reasonably sure the decision to use id + iid is less driven by "we don't want people guessing internal IDs" and more driven by query performance needs.

tetha2y ago

Yes, but the ability to guess IDs can make this security issue horrible, or much much worse.

To me, this is one of these defense in depth things that should be useless. And it has no effect in many, many cases.

But there is truely horrid software out there that has been popped in exactly the described way.

dijit2y ago

This meant that people could send password resets for any user if they knew their userID. The mail format was like user-1@no-reply.gitlab.com or something.

Since it's a safe bet that "user ID 1" is an admin user, someone weaponised this.

plagiarist2y ago

I've already resolved to never use Gitlab entirely on the basis of that CVE but that makes it worse.

yellowapple2y ago

s4i2y ago

> If you expose the issues table primary key id then when you create an issue in your project it will not start with 1 and you can easily guess how many issues exist in the GitLab.

no-dr-onboard2y ago

Business intelligence isn’t really applicable on the database level with guids..way too many abstraction layers down.

coldtea2y ago

>I see this "best practice" advocated often, but to me it reeks of security theater.

Sure. But by that time, it's will be game over if you don't also have the other layers in place.

The thing is that you can't anticipate any contigency. Bugs tend to not preannounce themselves, especially tricky nuanced bugs.

yellowapple2y ago

> Security can be (and should be) multilayered, it doesn't have to be all or nothing.

In this case the added layer is one of wet tissue paper, at best. Defense-in-depth is only effective when the different layers are actually somewhat secure in their own right.

It's like trying to argue that running encrypted data through ROT13 is worthwhile because "well it's another layer, right?".

> you'd be thanking all available Gods that you at least made the IDs not guassable - which would also give them also access to every user account on the system.

coldtea2y ago

Well, there's nothing "inevitable". It's a computer system, not the fullfilment of some prophecy.

You can have an attack vector giving you access to a layer, without guaranteed magic access to other layers.

But even if it "just delays the inevitable", that's a very good thing, as it can be time used to patch the issue.

1 more reply

metafunctor2y ago

lordgrenville2y ago

> The only case where this might be valuable is business intelligence

EE84M3i2y ago

See also "German Tank Problem" https://en.m.wikipedia.org/wiki/German_tank_problem

remus2y ago

In general it's a defense-in-depth thing. You definitely shouldn't be relying on it, but as an attacker it just makes your life a bit harder if it's not straightforward to work out object IDs.

In gitlabs case this is much less relevant, and it's more fo a cosmetic thing.

serial_dev2y ago

> In gitlabs case this is much less relevant (...)

Why, though? GitLab is often self hosted, so being able to iterate through objects, like users, can be useful for an attacker.

yellowapple2y ago

In my experience self-hosted GitLabs are rarely publicly-accessible in the first place; they're usually behind some sort of VPN.

1 more reply

remus2y ago

You're right, fair point.

worksonmine2y ago

> What value would there be in preventing guessing?

It prevents enumeration, which may or may not be a problem depending on the data. If you want to build a database of user profiles it's much easier with incremental IDs than UUID.

Hackers are creative and security is never about any single protection.

yellowapple2y ago

> If you can still crawl all IDs and figure out if someone has an account that way it helps filter out what username and password combinations to try from previous leaks.

worksonmine2y ago

Most of the time things don't have to go "horribly horribly wrong" to get exploited. It's more common to be many simple unimportant holes cleverly combined.

Do you leave your windows open when you leave from home just because the burglar can kick the front door in instead? It's the same principle.

1 more reply

JimBlackwood2y ago

tomnipotent2y ago

> extremely useful for my employer.

Knowing how many orders is placed isn't so useful without average order value or items per cart, and the same is true for many other kinds of data gleamed from this method.

JimBlackwood2y ago

That’s not correct. Not every market is the same in it’s dynamics.

Yes, most of the time that information was purely insightful and was simply monitored. However, at some moments it definitely drove important decisions.

1 more reply

kehers2y ago

One good argument I found [^1] about not exposing primary keys is that primary keys may change (during system/db change) and you want to ensure users have a consistent way of accessing data.

[^1]: https://softwareengineering.stackexchange.com/questions/2183...

dalore2y ago

It's how the British worked out how many tanks the German army had.

SkyMarshal2y ago

> This is especially important when you use sequential auto-incrementing identifiers with type integer or bigint since they are guessable.

sgarland2y ago

strzibny2y ago

Whether you think it's a real problem or not, if you want to solve it somehow, I compiled the current options one have in Rails apps:

https://nts.strzibny.name/alternative-bigint-id-identifiers-...

cyberfart2y ago

mnahkies2y ago

It can be really handy for scraping/archiving websites if they're kind enough to use a guessable id

1 more reply

heax2y ago

It really depends but useful knowledege can be derived from this. If user accounts use sequential ids the id 1 is most likely the admin account that is created as first user.

kelnos2y ago

> For example, Github had 128 million public repositories in 2020. Even with 20 issues per repository it will cross the serial range. Also changing the type of the table is expensive.

Beyond that, there are probably a lot of small, toy projects that have no issues at all, or at most a few. Quickly-abandoned projects will suffer the same fate.

Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs, github.com included.

zX41ZdbW2y ago

It is still under the limit today with 362,107,148 repositories and 818,516,506 unique issues and pull requests:

https://play.clickhouse.com/play?user=play#U0VMRUNUIHVuaXEoc...

ssalka2y ago

I'm guessing this won't be including issues & PRs from private repos, which could be substantial

zx80802y ago

Elapsed: 12.618 sec, read 7.13 billion rows, 42.77 GB

This is too long, seems the ORDER BY is not set up correctly for the table.

zx80802y ago

Also,

> `repo_name` LowCardinality(String),

This is not a low cardinality:

7133122498 = 7.1B

Don't use low cardinality for such columns!

1 more reply

zX41ZdbW2y ago

nly2y ago

That query took a long time

istvanu2y ago

silasb2y ago

[0]: https://github.blog/2023-04-06-building-github-with-ruby-and...

modderation2y ago

Perhaps too late, but Rails 7.1[1] introduced composite primary key support, and there's been a third-party gem[2] offering the functionality for earlier versions of ActiveRecord.

[1] https://guides.rubyonrails.org/7_1_release_notes.html#compos...

[2] https://github.com/composite-primary-keys/composite_primary_...

rglynn2y ago

Whilst I would agree that a monolith can run into scalability issues, I am not sure your characterisation of Rails as such is proportionate.

On this very topic, I would recommend reading GitLab's own post from 2022 on why they are sticking with a Rails monolith[1].

[1] - https://about.gitlab.com/blog/2022/07/06/why-were-sticking-w...

zachahn2y ago

I can't really comment on GitHub, but Rails supports composite primary keys as of Rails 7.1, the latest released version [1].

[1] https://guides.rubyonrails.org/7_1_release_notes.html#compos...

mvdtnz2y ago

If they use a db per customer then no one will ever approach those usage limits and if they do they would be better suited to a self hosted solution.

tcnj2y ago

karolist2y ago

Not according to [1] where the author said

1. https://yorickpeterse.com/articles/what-it-was-like-working-...

Maxion2y ago

I've toyed with various SaaS designs and multi tenanted databses always come to th forefront of my mind. It seems to simplify the architecture a lot.

rapfaria2y ago

> Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs

A bomb defused in a migration that takes eleven seconds

aeyes2y ago

The migration has to rewrite the whole table, bigint needs 8 bytes so you have to make room for that.

Edit: But there is another option which is logical replication. Change the type on your logical replica, then switch over. This way the downtime can be reduced to minutes.

semiquaver2y ago

2 more replies

winrid2y ago

This largely depends on the disk. I wouldn't expect that to take 30mins on a modern NVME drive, but of course it depends on table size.

2 more replies

tengbretson2y ago

In JavaScript land, postgres bigints deserialize as strings. Is your application resilient to this? Are your downstream customers ready to handle that sort of schema change?

Running the db migration is the easy part.

Rapzid2y ago

Depends on the lib. Max safe int size is like 9 quadrillion. You can safely deserialize serial bigints to this without ever worrying about hitting that limit in many domains.

1 more reply

jameshart2y ago

11 seconds won't fix all your foreign keys. And all the code written against it that assumes an int type will accommodate the value.

iurisilvio2y ago

Migrating primary keys from int to bigint is feasible. Requires some preparation and custom code, but zero downtime.

I'm managing a big migration following mostly this recipe, with a few tweaks: http://zemanta.github.io/2021/08/25/column-migration-from-in...

FKs, indexes and constraints in general make the process more difficult, but possible. The data migration took some hours in my case, but no need to be fast.

AFAIK GitLab has tooling to run tasks after upgrade to make it work anywhere in a version upgrade.

golergka2y ago

Being two orders of magnitude away from running out of ids is too close for comfort anyway.

justinclift2y ago

> github.com included

Typo?

zetalyrae2y ago

The point about the storage size of UUID columns is unconvincing. 128 bits vs. 64 bits doesn't matter much when the table has five other columns.

There's an orthogonal approach which is using bigserial and encrypting the keys: https://github.com/abevoelker/gfc64

But this means 1) you can't rotate the secret and 2) if it's ever leaked everyone can now Fermi-estimate your table sizes.

Having separate public and internal IDs seems both tedious and sacrifices performance (if the public-facing ID is a UUIDv4).

I think UUIDv7 is the solution that checks the most boxes.

[0]: https://uuid7.com/

[1]: https://en.wikipedia.org/wiki/Snowflake_ID

Merad2y ago

> The point about the storage size of UUID columns is unconvincing. 128 bits vs. 64 bits doesn't matter much when the table has five other columns.

solidsnack90002y ago

...and this has not just a size impact but also a substantial performance impact.

paulddraper2y ago

> 128 bits vs. 64 bits doesn't matter much when the table has five other columns.

!!!!

But those 5 other columns are not indexed.

---

There are three levels of database performance:

1. Indices and data fit in memory.

2. Indices fits in memory, data does not.

3. Neither indices not data fit in memory.

If you can do #1 great, but if you don't have that, fight like a madman for #2.

---

Doubling your index sizes is just makes it harder.

gfody2y ago

when x86-64 cpus were new the performance impact from switching to 64-bit pointers was so bad we had to create x32/ilp32 and the reason .NET still has "prefer 32-bit" as a default even today.

using 128-bit uuids as PKs in a database is an awful mistake

neonsunset2y ago

gfody2y ago

the 32bitpref corflag isn't part of a .net core target since those are always il, it is more properly a runtime concern as it should be

it's still the default in .net as of 4.8.1 (has been since it was introduced in 4.5 roughly coinciding w/java's pointer compression feature which is also still the default today)

1 more reply

s4i2y ago

solidsnack90002y ago

> There's an orthogonal approach which is using bigserial and encrypting the keys...

Another variant of this approach: https://pgxn.org/dist/permuteseq/

It is also feasible to encrypt the value on display (when placing it in URLs, emails, &c):

https://wiki.postgresql.org/wiki/Pseudo_encrypt

nine_k2y ago

If you have a column that is used in many joins, there are performance reasons to make it as compact as possible (but not smaller).

newaccount7g2y ago

canadiantim2y ago

Would it ever make sense to have a uuidv7 as primary key but then anther slug field for a public-id, e.g. one that is shorter and better in a url or even allowing user to customize it?

Horffupolde2y ago

Yes sure but now you have to handle two ids and guaranteeing uniqueness across machines or clusters becomes hard.

vrosas2y ago

traceroute662y ago

Slight nit-pick, but I would pick up the author on the text vs varchar section.

The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".

This horse bolted a long time ago. Its not "not much", its "none".

[1] https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use... [2] https://www.postgresql.org/docs/current/datatype-character.h...

alex_smart2y ago

>The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".

FTA:

>So, as you can see, the text type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks.

traceroute662y ago

> They then also show that there is in fact a significant performance difference when you need to migrate your schema to accodomate a change in length of strings being stored.

Which is a pointless demonstration if you RTFM and design your schema correctly, using text, just like the manual and the wiki tells you to.

> the text type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks.

Which is exactly what the manual tells you ....

"For many purposes, character varying acts as though it were a domain over text"

alex_smart2y ago

And what is wrong with someone figuring out for themselves and explaining to others why some suggestion makes sense logically, rather than just quoting the manual?

1 more reply

exabrial2y ago

bluerooibos2y ago

Has anyone written about or noticed the performance differences between Gitlab and GitHub?

They're both Rails-based applications but I find page load times on Gitlab in general to be horrific compared to GitHub.

golergka2y ago

I used Gitlab a few years ago, but then it had severe client-side performance problems on large pull requests. Github isn't ideal with them too, but it manages to be decent.

wdb2y ago

Yeah, good reason to spit up pull requests ;) I do think it improved a lot over the last two years, though

oohffyvfg2y ago

> compared to GitHub.

this is like comparing chrome and other browsers, even chromium based.

heyoni2y ago

I mean GitHub in general has been pretty reliable minus the two outages they had last year and is usually pretty performant or I wouldn’t use their keyboard shortcuts.

There are some complaints here from a former dev about gitlab that might provide insight into its culture and lack of regard for performance: https://news.ycombinator.com/item?id=39303323

Ps: I do not use gitlab enough to notice performance issues but thought you might appreciate the article

imiric2y ago

> I mean GitHub in general has been pretty reliable minus the two outages they had last year

Huh? GitHub has had major outages practically every other week for a few years now. There are pages of HN threads[1].

There's a reason why githubstatus.com doesn't show historical metrics and uptime percentages: it would make them look incompetent. Many outages aren't even officially reported there.

I do agree that when it's up, performance is typically better than Gitlab's. But describing GH as reliable is delusional.

[1]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

heyoni2y ago

Delusional? Anecdotal maybe…I was describing my experience so thanks for elaborating.

I only use it as a code repository. Was it specific services within GitHub that failed a lot?

2 more replies

anoopelias2y ago

More comments on this submission: https://news.ycombinator.com/item?id=39333220

vinnymac2y ago

gfody2y ago

> 1 quintillion is equal to 1000000000 billions

it is pretty wild that we generally choose between int32 and int64. we really ought to have a 5 byte integer type which would support cardinalities of ~1T

klysm2y ago

Yeah it doesn't make sense to pick something that's not a power of 2 unless you are packing it.

postalrat2y ago

I guess that depends on how the index works.

gfody2y ago

we usually work with some page size anyways, 64 5-byte ints fit nicely into 5 512-bit registers, and ~1T is a pretty flexible cardinality limit after ~2B

azlev2y ago

Aeolun2y ago

josephg2y ago

> As I discussed in an earlier post[3] when you use Postgres native UUID v4 type instead of bigserial table size grows by 25% and insert rate drops to 25% of bigserial. This is a big difference.

Does anyone know why UUIDv4 is so much worse than bigserial? UUIDs are just 128 bit numbers. Are they super expensive to generate or something? Whats going on here?

AprilArcus2y ago

couchand2y ago

perrygeo2y ago

The 25% increase in size is true but it's 8 bytes, a small and predictable linear increase per row. Compared to the rest of the data in the row, it's not much to worry about.

stephen1232y ago

I think its because of btrees. Btrees and the pages work better if only the last page is getting lots of writes. Iuids cause lots of un ordered writes leading to page bloat.

barrkel2y ago

So you pay taxes at both insert time and later during selection.

eezing2y ago

martinald2y ago

Is it just me that thinks in general schema design and development is stuck in the stone ages?

Also the point about external IDs and internal IDs - why can't the database/ORM do this more automatically?

Are there any tools people use for this?

sjwhevvvvvsj2y ago

doctor_eval2y ago

In my experience, getting the data structures right is 99% of the battle. If you get that right, the code that follows is simple and obvious.

Some people seem to think I’m weird for starting with the database (and for using plpgsql), but I think it’s actually a superpower.

jupp0r2y ago

jimbokun2y ago

It’s part of the dirty little secret of why document databases and other NoSQL systems became popular.

Required even less up front thinking about how to model your data. Throw some blobs of JSON into Mongo or whatever, and worry about the rest later.

thom2y ago

Somewhat ironically, many modern enterprises have peeled all the way back to SQL for a huge amount of logic anyway, so I don’t think we’re done caring about database schemas quite yet.

1 more reply

sjwhevvvvvsj2y ago

Yup - and you can’t code your way to real scale either. At real scale the game is all about data structures. Code just gets them from A to B.

Or as they say at Google, the job of SWE is “moving protos”.

nly2y ago

Exactly that. Sitting down and thinking about your data structures and APIs before you start writing code seems to be a fading skill.

EvanAnderson2y ago

It absolutely shows in the final product, too.

timacles2y ago

Any abstraction you could come up with wouldnt fit 90% of the other cases

Atotalnoob2y ago

EF core doesn’t to drop/create for columns in db providers that support renaming columns. It only does it for ones that don’t like MySQL or SQLite

winrid2y ago

Not a silver bullet for every project but the Django ORM largely solves this with its migrations. You define your table classes and it just generates the migrations.

Throw in a type checker and you're in pretty good shape.

Rust also has sqlx which will type check your code against the DB.

dxdm2y ago

You still need to know what SQL the migration will run (take a look at `manage.py sqlmigrate`) and most importantly how your database will apply it.

basil-rash2y ago

1 more reply

mvdtnz2y ago

winrid2y ago

also, every company I've worked at used ORMs in some capacity. Sometimes, not always.

Also, I don't really work on any apps with only four customers. either they have almost a million or zero :P Try again. :)

winrid2y ago

I don't think insults are really necessary here.

Also ORMs can be very useful, just don't do dumb stuff, like with any technology.

I use them when appropriate.

cschmatzler2y ago

Merad2y ago

> Also the point about external IDs and internal IDs - why can't the database/ORM do this more automatically?

emodendroket2y ago

rob1372y ago

I found this post very useful. I'm wondering where I could find others like it?

aflukasz2y ago

rglynn2y ago

I like this, although a bit narrower - https://blog.mastermind.dev/indexes-in-postgresql

cosmicradiance2y ago

Here's another - https://zerodha.tech/blog/working-with-postgresql/

sidcool2y ago

Great read! And even better comments here.

firemelt2y ago

so anyone use schema.rb in production? even dhh once campfire use .sql instead schema.rb

j / k navigate · click thread line to collapse