1. What if the domain teams don't actually care to maintain data quality or even care about sharing data in the first place? This model requires every data producer to maintain a relationship with every data consumer. That's not gonna happen in a large company.
2. Who pays for query compute and data storage when you're dealing with petabytes and petabytes of data from different domains? If you (the data platform team) bill the domain teams then see above, they'll just stop sending data.
3. Just figuring out what data exists in the data mart (which this essentially is describing) is a hassle and slows down business use cases, especially when you have 1000s of datasets. You need a team to act as sort of a "reference librarian" to help those querying data. You can't easily decentralize this.
4. How do you get domain teams to produce data in a form that is easy to query? Like what if they write lots of small files that are computationally difficult to query, whose gonna advise them? Data production is very related to data query performance at TBs scale. The domain team is not gonna become experts or care.
5. What do you do when a domain team has a lot of important data but no engineering resources? Do you just say "oh well, we're just a self-service data platform so no one gets to access the data"?
Unless this is google, that doesn’t make any sense. That’s an average of 7.5mb per human on the planet, every day.
Thinking about what to store (and what to log) is not trivial and takes careful consideration. Plus, there's always the argument: "But what if we need something that we forgot to store or log?".
Answering that takes time and risk-acceptance that most developers in most projects don't get.
"Datalake" was just a pseudo-solution to gain peace-of-mind: We throw everything into a big bucket and figure it out later.
Just like performance optimization, it's cheaper to buy more hardware than to pay for humans to think about it and coordinate.
Believe it or not this is just for security data. But that fact combined with SOA leads to lots of logs. (5Pbs is the uncompressed amount, but we decompress incoming data, then ETL).
But, it's not obvious it actually makes sense for an organisation to value analytical concerns at a similar level to operational concerns, with similar resourcing. The value to the organisation of building and maintaining analytical APIs to serve data products is likely to be considerably less -- perhaps by one or two orders of magnitude -- than the value produced by actually performing and maintaining the core operational function.
If the stars align, maybe in future the analytical data could be used as input to an optimisation project that improves some key metric of the operational function by 10% or 1%. How much is it worth paying to have the possibility of an outcome like that in future? Not obvious, really comes down to how valuable a 10% or 1% lift would be, and how much it would cost, needs some kind of business case. Not obvious that resourcing analytical data APIs owned by arbitrary domain teams everywhere is a sound investment for the business.
This doesn't happen in small or mid companies, either. Or if it does happen, it happens begrudgingly. SWE's have too much to do.
[1]: https://martinfowler.com/articles/data-monolith-to-mesh.html
[2]: https://martinfowler.com/articles/data-monolith-to-mesh.html...
As an example, a Team1 might define the manufacturer of a Sprocket as the company that assembled it, whereas a Team2 might define the manufacturer as the company that built the Sprocket's engine. Since the purpose of a datamesh is to enable other teams to perform cross-domain data analytics, there needs to be reconciliation regarding these definitions, or it'll become a datamess. Where does that get resolved?
I thought big portion of need that data mesh fills is the organizations who are missing resources in their core BI team.
Treat data assets like micro services and pipelines like network. Period.
Prescribing everything else rubs me wrong way.
So, data mesh is: architecture in which data in the company organized in loosely coupled data assets.
It works well, but has many issues too. User’s SQLs and input data can differ, often in unpredictable ways, because they bring their own and expect the central team to handle the rest. Those edge cases break the standardization rules, fails the workflow, confuses the user because the platform is a black box, and they ask about changing it or adding a new feature. Now your standardization asks are bottle-necked by this central team, and the options are:
- to wait for the central team to fix/improve it
- find some hack around the platform
- don’t use the platform and its associated toolings, so you build it yourself and have another disjoint system for a specific use case
- central team might build a feature that one team asked for 1 years ago, but now nobody needs it anymore and nobody knows why it’s in code. Repeat many many times for various asks over the years and your code base is likely a foreign mess.
- give your resources/funding to the central team to prioritize your ask. When built and a few years later, the central team owns something they themselves never wanted.
By making data as the product (abstracting all the gory details), you are fundamentally engaging with data through a UI or an API. As you expose these products they become accretive while fundamentally encapsulating the domain expertise within them.
In a lot of orgs this goes sideways and the infrastructure teams end up owning everything and never have time to do anything else. Usually this happens due to upper management putting on the squeeze.
In order for teams to actually own their infrastructure and data we need better tooling to help them. This is coming along nowadays but isn’t fully there.
So what kind of communication structures are good, and in what circumstances? How do we structure work so that we don't have to communicate about everything? When do we fall back to ad-hoc video chat or even in-person meetings? These are the kinds of questions that 21st-century management has to answer. It's fascinating to watch people grapple with them.
Whether or not that data lake is semi-operated by the team (as proposed in the article) or operated centrally, requiring the lake's ETL process to use at least some of the APIs and tools used for transactional interaction goes a long way towards making data architecture tend towards sanity.
Resist the temptation of things like RDBMS-level CDC/log stream capture or database snapshots for populating data lakes (RDS Aurora's snapshot export/restore is like methamphetamine in this area: incredibly fast and powerful, has a very severe long term cost for data lake uniformity and usability).
I'm not saying "every row in the data lake must be extracted by making the exact same API hit that an internet user would make, with all of the overhead incurred by that". You can tap into the stack at a lower level than that (e.g. use the same DAOs that user APIs use when populating the data lake, but skip the whole web layer). Just don't tap into the lowest possible layer of the stack for data lake ETL--even though that lowest layer is probably the quickest to get working and most performant, it results in poor data hygiene over the medium and long term.
E-commerce example: the warehouse team might produce a batch of datasets, and the web site time as well. A data lake approach would have a single data team owning both sets of datasets. A data mesh would have each team be responsible for maintaining their own, and making sure they’re interoperable (like having a shared order ID concept).