First of all, how does it persuade you of that? The article touches a really small (though incredibly important for up-time) subject.
Secondly, in any large company, the majority is 'bloat'. It's security engineers, code reviews, data architecture, HR, internal audit teams, content moderators, ccrum masters and I can keep going. In a start-up many of these roles can be ignored, becaus growth > stability. In a large organization, part of the bloat helps insure a certain amount of stability that's necessary to keep an organization alive.
If a product is mature enough, like Twitter seems to be, removing engineers won't instantly crash the product. It'll happen slowly. Bugs will creep in, because less time is spent on review and over all architecture. Security issues will creep in because of about the same issues and less oversight. Then, once this causes enough issues for the product to actually crash, the right people to fix it quickly might not be there anymore. That's when fixing the issues suddenly takes a lot more time.
If the current state of affairs at Twitter keeps up, it'll probably be a slow descent into chaos. Especially with Elon pushing for new features to be implemented quickly, inevitably by people who cannot fully understand the implications of said features, because 80% of knowledge is missing.
By flowing from many people think it's bloat - I'll tell you what's really going on to tiny team of 1~3 built whole infra for critical component.
I'm not really trying to make commentary on whether or not Twitter engineering was bloat, or whether or not I think it'll hit problems in the future. Just commenting on the fact that the article broke my expectations a little bit as a reader.
There's no doubt that OP built a great and stable automation layer on top of Mesos for caching workloads. But there are numerous other types of workloads on top of Mesos (including, I presume mission-critical database deployments that need well-disciplined draining protocols to shift between nodes), as well as administrative needs for the Mesos-to-infrastructure level, and things running on bare metal below the Mesos level. These things all needed dedicated SREs, and the absence of these SREs could result in a scenario like the one mentioned in the Twitter thread I linked - two obscure mutually-dependent components expire and cannot be re-provisioned using documented tools.
I also think an important meta-point is that when Twitter was bringing in substantial revenue from advertising, every minute of downtime would have significant costs - costs that could make it easily worthwhile to "over-provision" SRE talent. With advertisers pausing engagement, perhaps Twitter loses less money from a day-long outage than it would save having the right talent to turn a day-long outage into a minutes-long outage.
Twitter is only judged by its profitability (namely, Musk's ability to service debt without selling more Tesla stock than he already has), while most other tech companies (both public and private) are judged by both profitability and revenue growth. If you want both, larger SRE teams, to say nothing of feature development and regulatory compliance teams, start to make a lot more sense.
It also (a) increases the bus factor, [1] and (b) allows people to take vacations and time off without having to watch their phones like hawk.
It's all fun and lulz until it happens to you.
You'd be surprised, but that's not necessarily the case.
One of my friends was such a person in a shoe making company as a designer. Instead of giving her a raise, they fired her.
Cue them re-hiring her a month or two later after they found out the hard way that the less experience subordinate really couldn't handle the job of the two of them on their own.
(I do also like the lottery version and use them basically interchangeably, though the shorthand is always "bus factor" for me)
It's not about having enough people to do the work even if someone quits, it's about having enough people that know how to do something that we aren't losing chunks of knowledge if someone quits (or dies, or gets fired, or gets sick, or etc).
It doesn't make sense to me to treat people as part of a conduit bus that are interchangeable as long as there are enough people.
It's amazing to me how many people following the Twitter saga, some familiar with or actually working in technology, thought that Twitter would crash within days of the engineers being fired. And because it didn't, the job cuts are justified.
With that said, there are differences between internal systems and something like Twitter on the public internet. I assume that Twitter is a system under constant attack. What happens when the next log4shell level vulnerability comes out?
The car analogy is amusing, but how much does it really hold up? Have we ever seen another major social media company drop this much of its staff in one go? I certainly can’t think of an example. I think we’re in somewhat uncharted waters here.
A driverless car won’t last long, we know that for a fact. I think it remains to be seen how long a bloatless twitter can last. I’m personally optimistic.
It's also really hard to define 'last' though. Does 'last' mean just for up-time? Does it mean up-time without a major security incident while maintaining the same DAU? Does it mean business as usual on all fronts except number of employees? We know that Twitter already had some security issues with their God Mode admin panel.
I really wonder, for example, whether no angry ex-employees still have access to critical systems or data. This is usually pretty well regulated in large organizations like twitter, but since they've lost the majority of their staff, who knows when the people looking after that left?
I'm not convinced Twitter had a ton of bloat. (Most of the teams actually involved don't seem to think so). Just because Elon can't understand something, doesn't make that thing "bloat".
Twitter definitely had a few weird features that could be cut (the audio podcasting thing, for example). But calling most of Twitter microservices "bloat" is about as dumb as calling a cars Seatbelt and Airbag and Crumple Zones and that spare tire in the trunk "bloat" -- it's only "bloat" if you assume all people will always be perfect and no one will ever make a mistake anywhere, and nothing bad will ever happen.
Company soft killed the product, everybody left, they didn't hire anyone to replace. We went from 20 engs to 3. Worst codebase ever made by ex FANGs hotshots who thought they understood something about system architecture. Very "clever" and complicated. Data consistency issues happening everyday, likely due to misuse of messagging queues. Chargebacks being ignored and mailed in physical letters every month. A couple of millions going through the platform every year.
My task was to run a team of mostly juniors maintaining and adding features to that mess.
I had no clue what that codebase was doing. We just left things as they were, fixing fires as they came. Nothing too bad happened. Slowly built a leaner replacement for some components. We simplified things over time and we even rebuilt some of the knowledge of the old platform, which helped with the daily outages.
The issues started happening not as often. Eventually. I moved on from that company, removing again a big chunk of knowledge. Over time I've heard tales of other people coming in and rebuilding that knowledge, over and over.
The platform is still standing.
It's a tricky one, because on one hand it increase my trust that their system was built robustly, but at the same time the passage of time would increase the chance of unseen/unaddressed "wear an tear" (bot figurative and literal) that might be going unaddressed, or under-addressed. But we have no view into that.
We won't really know until they suffer a major problem whether or not they have enough staff yet to keep sufficient maintenance going that such an event doesn't cascade into something much worse and/or whether or not they will be able to recover from it in a reasonable amount of time.
Horrible systems can survive, but often they survive through sheer luck.
I think it's hard to just use time as a measure. If there are no security issues and they add no features, then things should run fine. Which, ironically points to how solid the team was that was fired.
Of course if Twitter not only limps along, but thrives in this new setup then I'll definitely change my opinion. Being in the US, this might end up the case while Twitter for the rest of the world falls apart.
Keep in mind, that I think Twitter was bloated and needed a big shakeup. Randomly dumping people and those who tried to correct me is not the heuristic I would have used.
With that said, Twitter still has 2 huge problems. No vision and saddled with an enormous amount of debt. Right now, Musk is taking the PE approach to cut and milk what's there. The problem is, there isn't much to milk.
This is an excellently apt analogy, in light of Twitter's new owner.
Unless the company that creates it is owned by Elon ;)
Because they work for companies where the product would fail within days of them being fired themselves.
The truth in retrospect is that it was my fault (and my upper leadership's) that I wasn't replaceable. I created a knowledge silo around myself since I wanted to move fast and figured I could prevent the team from being bogged down in complexity if I just handled it myself and while that worked in regards to delivering out-sized results for the available bandwidth, it also was a risk that materialized as described above. So while I do believe that everyone should be replaceable and it's their responsibility to be, it's not always the case and products can live and die by it.
I worked at a company where everything hinged solely on one guy working from another country. When he left, loss of institutional knowledge took about three days to show real effects as things also came down crashing.
I worked hard to make _myself_ replaceable for when I left, it was a pretty good exercise, but me having that degree of freedom was symptomatic of the problems of the company.
That said you can replace people and build back that institutional knowledge -- both loss and gain take significant amount of time.
For the type of jobs at hand here, One of the things I learned is that nobody is essential. Even that person you think is essential.
Of course it could go either way but the jury is currently out. It’s entirely possible that severe company-impairing technical breakdowns are already in progress and unrecoverable.
Or maybe not.
On the other hand. As an engineer, we tend to attach way too much self importance to our roles. Like if we're not there entering the "numbers" 4 6 15 16 24 32 every 108 minutes, the entire business is going to crumble. So... this is one I'm going to watch with a keen eye.
Never I have encountered an engineer that thought that.
> Never I have encountered an engineer that thought that.
there are people all up and down this page saying effectively just that. well i guess i'm assuming most of these people are engineers in the software sense of the word.
It's not like Twitter was bug free before. How many times it annoyingly refreshed the timeline while I was reading something, or when it shows notification that it failed to send the DM, and when you retry it says "you've already wrote this", or you open the reply dialog, but it freezes, has no send button at all, so you have to re-open it. All of this was happening to me pretty regularly long before Elon came along.
As we all know, just hiring more people is not necessarily the solution to every problem, and to me it seems it was exactly what Twitter tried to do in the past. Now they deconstructed it to the bare bones, which will clearly show what are the core problems and requirements. They basically turned Twitter back into a startup. And from that new starting point they can hire again to cover the needs as they arise. If they succeed it will be a huge success as they'll end up with far more optimal team (and huge savings), and of course, if they fail to catch up with problems it will be a huge failure. We'll see how well Musk can manage it...
Anyone thinking "huge success" is unrealistically optimistic, IMO.
It's going to be another MySpace/AOL/Bebo, added to the list of dumbest purchases ever.
And that's still going to be true if the point was to destroy the original community and replace it with a different political orientation.
How many millions are in a billion?
the power law applies to any big organization. 20% of the people do 80% of the work, whilst 80% of the people are just there for "support".
whatsapp was run by a team of like 20 people or something when they got acquired for $20 billion. for a simple software product, you don't really need that many people. in fact, more people often means bad software. you just need a small group of very talented engineers to run the product and add new features when necessary.
big (and especially public) companies often times need to hire a lot, just to look like a real company.
now that twitter is private, elon has no responsibility to public investors and can focus less on looking like a real company and more on doing what needs to be done to cut bloat/costs and improve product
https://twitter.com/IlluminatiGanga/status/15946097904324444...
new members joining in 1970. hmmm.
That said, there are lots of bugs in Twitter now, today, when they presumably had the benefit of being in stable mode for a long time. For example, Twitter regularly refreshes and loads new tweets while I'm reading them, pushing the tweet I was in the middle of reading out of view. That seems like a pretty silly bug to exist in a mature product. I regularly reach a state where I have to kill the app and relaunch it because all of the "back" commands just minimize the app instead of taking me back to the timeline. I could go on.
But regarding the bugs, I’m totally with you. Same here. I use Twitter only in the browser. Browse long enough and the page reloads as if it ran out of memory.
Have you implemented a system which stores hundreds of billions of pieces of media content and makes different slices of them immediately available to hundreds of millions of users?
I'm currently the first and sole architect on a product that was built by only devs. I really know why I exist.
Where’s the 1000s of engineers for Postgres? Most stuff that works is made by a handful of people. Look at io_uring it’s basically one guy at Facebook…
You're comparing Postgres to Twitter?
If some of the people making comments like this actually work in tech, then yeah, maybe there is a lot of bloat to be cut.
https://www.citusdata.com/blog/2017/04/20/analyzing-postgres...
Not tens of thousands at any one organization supporting postgres and clickhouse.
No single organization needs even hundreds of people to support these apps. You just need a good architecture and a handful of dba's, developers, and sysadmins... maybe.. depending on your scale. At many smaller orgs you can probably get away with one.
It provides a house for your clicks, then you sell the clicks from your house.
In twitters case I would recommend they use checkhouse which will help them monetize their checks. Tumblr is already way ahead of them.
You focused mostly on additive bloat, there's also multiplicative bloat in the form of multiple teams focused on building separate versions of the same product to increase likelihood of success and empire building where leaders don't actually have a remit large enough to support the team size they have, but they have woven a narrative that defends the necessity nonetheless. Put everything together and teams are very easily 6x+ larger than they absolutely need to be to get a product into market.
Please tell us in detail about the Twitter stack.
Because I always find it fascinating how people think they can estimate the effort to maintain it whilst having next to no understanding what so ever of the tech stack.
1. A single person can run a mastodon instance in their spare time. Spinning up some containers for the app, a background worker and a database is quite simple.
2. Modern devops tooling makes it fairly trivial to spin up 10k instances of a container instead of 1, by just altering a number in a k8s manifest somewhere.
3. Ergo, a single person equipped with modern tooling (and sufficient funding) could spin up any number of mastodon instances.
4. Twitter is just a big mastodon instance.
5. Now that keeping everything up is sorted, add another 99 devs for feature development and you are done.
Now this is obviously faulty logic because points 3 and 4 are very false, but they look reasonable enough at first glance.
15 database admins
10 linux sys admins
5 kubernetes specialists
10 windows tech support
25 front end developers
15 back end developers w/ Scala
10 machine learning experts
Whether that makeup could or couldn't do it is a different question, or whether it would be a different mix; all of that is up for debate, but the 1/99 ratio is just one very specific, extreme, and laughable mix for anyone who has supported a system of any real size.
Yes, using open source code built and maintained by 732 contributors: https://github.com/mastodon/mastodon/graphs/contributors
These can make for a massive hairball of complexity that can swell the number of people needed to support it.
This reminds me of a talk I once saw by a Netflix SRE, who showed a crazy convoluted mess of a diagram with thousands of crisscrossing lines going everywhere, and him screaming "No one understands Netflix!!!"
Source: Wikipedia pages for both
It was also pathetically unprofitable, and had serious problems with inappropiate child photos, gore etc.
Some of those problems require man power, there is no 10 man team who are very good at devops who can solve that.
THEY could probably do it with 100 people, YOU cannot.
100 people is most likely within the ballpark for a group of people whose sole purpose is to write and maintain twitter's tech stack. Unfortunately, that is not NEARLY the sole purpose of most people in businesses and that adds all kinds of productivity hits.
What happens is that people like yourself become convinced that's the only way to operate.
Likewise, bringing in Ad money would be a few more hundreds, because you need to chase leads in all countries.
Getting the Ads to work? That's tech and I'd be surprised if it was less than 100 people, too.
I subpoena telcos all the time. My sense is that the number is closer to 2 to 3 dozen.
Maybe the user facing site, but that's just the tip of the iceberg.
There are plenty of internal/backend/restricted systems to support and/or monetize this part.
And that's not counting the huge number of support people & moderators needed.
It is so incomparable in scope I don't know why people bring it up.
They are entirely different technological challenges.
The point I’m trying to make is that it takes some effort (beyond just the plumbing) to create an experience that folks actually want to use on an ongoing basis
And a website is easy. You could do it with 1 person.
But Elon is such a machine, he could keep it running by himself.
Making it globally available and legally compliant, that's where the next few thousand folks come in.
The people shouting loudly about how Twitter must have been so bloated are really just shouting their obvious inexperience working at global scales or their localized ambitions.
Could there be too many employees at Twitter? Sure. Most companies have dead weight. The number who were "extra" is probably not 9/10ths the employees though.
This is because you don't see the complexity. What you see as a Twitter user is a fraction of what's actually there.
You have to build a platform for ads. Not just serving ads, but allowing advertisers to prepare their collateral, preview them, get their results, and be billed. So that's an entire content and invoicing platform separate from your main feed.
And since your platform is all user generated content, you've got to build a moderation pipeline. A place for users to make reports, but also an interface for your content moderators to view content and make decisions. Oh, and while you're there you'd better build a portal for law enforcement to make data requests, along with your DMCA takedowns. Oh yeah, DMCA - that's another whole thing you've got to worry about.
Then the EU comes along and needs you to build something to support your GDPR obligations. Then India wants something similar, but only for its citizens. Your users also want verification, so better build that platform for securely verifying accounts and awarding checkmarks.
It snowballs. Was Twitter's engineering group bloated? Probably. Most large companies are. Could you run the whole Twitter tech stack as it exists today with a hundred people? Absolutely not.
Separately, some commenters here are flatly delusional about the effort to ship a site, android and ios apps, internal mod tools, help docs, support, and legal docs in 34 supported languages. Not to mention obeying laws in all the countries that implies.
Or image and video hosting! With recoding of videos, resizing of images, and the management of what is surely petabytes of images and videos with very high reliability! That is not a 1, 2, or 3 person job to do well.
Anyone who has used Twitter, have you seen any evidence they do this beyond extremely basic geographical targeting.
Like people keep listing off all this stuff when we’ve all used the site and can see if it does have a team working on it then they’re not doing it to the levels of their competitors.
It takes an army of engineers to build a resilient architecture at Twitter's scale.
And why are we even talking about "keeping the lights on"? Elon is claiming he's going to build a better video platform than YouTube, complete with better tools and for creators, for crying out loud.
"20 with cloud, 40 without. So much overlap between iOS, Android, and the web, three people can do all three. More for the backend." https://twitter.com/realGeorgeHotz/status/159371372367535718...
>Their moderation
Above I assumed their moderation team was probably larger than their engineering team, and mostly contractors. Thus I kept my estimate to the size of their engineering team.
What we are watching is a massive failure event right now and the question really is if there's enough time for twitter management to fill in the gaps before there's an outage.
That's how it couldn't prove the claim.