Data Mesh Architecture (opens in new tab)

(datamesh-architecture.com)

128 pointsaiobe4y ago45 comments

45 comments

Wow this is an oversimplification. I've had years of experience working in a data lake within a FAANG handling > 5 PBs of data per day ingest. There's so many things this misses:

1. What if the domain teams don't actually care to maintain data quality or even care about sharing data in the first place? This model requires every data producer to maintain a relationship with every data consumer. That's not gonna happen in a large company.

2. Who pays for query compute and data storage when you're dealing with petabytes and petabytes of data from different domains? If you (the data platform team) bill the domain teams then see above, they'll just stop sending data.

3. Just figuring out what data exists in the data mart (which this essentially is describing) is a hassle and slows down business use cases, especially when you have 1000s of datasets. You need a team to act as sort of a "reference librarian" to help those querying data. You can't easily decentralize this.

4. How do you get domain teams to produce data in a form that is easy to query? Like what if they write lots of small files that are computationally difficult to query, whose gonna advise them? Data production is very related to data query performance at TBs scale. The domain team is not gonna become experts or care.

5. What do you do when a domain team has a lot of important data but no engineering resources? Do you just say "oh well, we're just a self-service data platform so no one gets to access the data"?

d--b4y ago

At some point one has to ask, what are you guys doing ingesting 5pb of data per day?!

Unless this is google, that doesn’t make any sense. That’s an average of 7.5mb per human on the planet, every day.

fn14y ago

Why the concept "data lake" emerged in the first place:

Thinking about what to store (and what to log) is not trivial and takes careful consideration. Plus, there's always the argument: "But what if we need something that we forgot to store or log?".

Answering that takes time and risk-acceptance that most developers in most projects don't get.

"Datalake" was just a pseudo-solution to gain peace-of-mind: We throw everything into a big bucket and figure it out later.

ricardobeat4y ago

Worked in a similar environment. Events and logs. When a single page view does 200+ database queries, triggers another hundred requests, ML services, analytics and other tracking, and a transaction kicks off a chain of a hundred events in your stream, it's pretty easy to reach those insane numbers. That one page view can easily add 1MB of data.

Just like performance optimization, it's cheaper to buy more hardware than to pay for humans to think about it and coordinate.

kevinsundar4y ago

I did say FAANG :)

Believe it or not this is just for security data. But that fact combined with SOA leads to lots of logs. (5Pbs is the uncompressed amount, but we decompress incoming data, then ETL).

1 more reply

shoo4y ago

I believe the data mesh claim is that as well as the operational API exposed and supported by some given domain team, the domain team also gets a new goal of exposing and supporting an analytical API to deliver a data product to potential consumers. For this to happen, an organisation would need to value this new objective -- perhaps comparably to operational objectives -- and fund and resource it adequately. Arguably throwing enough budget and appropriately skilled headcount at it might address points 1, 2, 4, 5.

But, it's not obvious it actually makes sense for an organisation to value analytical concerns at a similar level to operational concerns, with similar resourcing. The value to the organisation of building and maintaining analytical APIs to serve data products is likely to be considerably less -- perhaps by one or two orders of magnitude -- than the value produced by actually performing and maintaining the core operational function.

If the stars align, maybe in future the analytical data could be used as input to an optimisation project that improves some key metric of the operational function by 10% or 1%. How much is it worth paying to have the possibility of an outcome like that in future? Not obvious, really comes down to how valuable a 10% or 1% lift would be, and how much it would cost, needs some kind of business case. Not obvious that resourcing analytical data APIs owned by arbitrary domain teams everywhere is a sound investment for the business.

cgio4y ago

Just on point 1, maybe in the context of said FAANG data quality was by choice if at all. In other industries, e.g. Finance, it can be regulated and audited, so operational teams care in terms of not being able to continue operations. That addresses point 2 partially, it is a complex topic, though.

ceeplusplus4y ago

In my experience data quality in finance is much worse than FAANG. It's common to have just the raw data feed from markets/trades/network dumped into a OLAP DB and whoever is using it has to sort through it whereas FAANG have data engineers to clean stuff up.

1 more reply

jerglingu4y ago

On point 1: "That's not gonna happen in a large company."

This doesn't happen in small or mid companies, either. Or if it does happen, it happens begrudgingly. SWE's have too much to do.

robertlagrant4y ago

It really feels like data mesh is a fairly half baked concept born out of short term consulting gigs and a desire to become a technical thought leader.

jerglingu4y ago

I got same this feeling when reading the original white paper linked on the page[1]. It's filled with the kind of bloated abstract "consultant speak" chosen to mask relatively straightforward ideas. And then there is this casual claim at the end of the paper that IMO discredits everything preceding it[2]: "Luckily, building common infrastructure as a platform is a well understood and solved problem;"

[1]: https://martinfowler.com/articles/data-monolith-to-mesh.html

[2]: https://martinfowler.com/articles/data-monolith-to-mesh.html...

i_like_waiting4y ago

Reminds me of first OLAP cubes a lot, something that consultant online praise as much as possible, just so then 3-4 years later they are contracted by the company to fix the mess it created.

edmundsauto4y ago

What are the downsides of OLAP cubes, and how were they fixed? Curious to level up my understanding.

1 more reply

politelemon4y ago

Is there an underlying assumption here that all of the datasets' domains are perfectly in sync with each other in the context of domain metadata?

As an example, a Team1 might define the manufacturer of a Sprocket as the company that assembled it, whereas a Team2 might define the manufacturer as the company that built the Sprocket's engine. Since the purpose of a datamesh is to enable other teams to perform cross-domain data analytics, there needs to be reconciliation regarding these definitions, or it'll become a datamess. Where does that get resolved?

cgio4y ago

Data mesh is not a complete framework, more sociopolitical rather than technical at the moment. When tested in practice, I think you already allured to a key technical component that will need to be more central, I.e. reconciliation. What that means in terms of domain ownership of reconciliation that is an open question.

gxt4y ago

The chief data officer in close collaboration with the chief data engineering officer must elaborate automated normalization guidelines backed with implementations used across all data streams to insure any skew in the data model is limited to non production environments and all data entities are materialized consistently across the whole data model.

i_like_waiting4y ago

what type of company you are working for? Usually there is not even CIO, I haven't even heard about company with both CDO and CDEO (or even CDEO itself).

I thought big portion of need that data mesh fills is the organizations who are missing resources in their core BI team.

2 more replies

LaserToy4y ago

I looks like a weird attempt to build a consulting business around a simple idea.

Treat data assets like micro services and pipelines like network. Period.

Prescribing everything else rubs me wrong way.

So, data mesh is: architecture in which data in the company organized in loosely coupled data assets.

i_like_waiting4y ago

So if I understand this correctly, data mesh is just data mart, that doesn't bring data in database as a table, but uses S3 storage instead (I assume because thats cheaper in the cloud?)

skrrr4y ago

That + a central data platform team that provides infra, quality monitors, data lineage and catalogue capabilities + a central team that provides guidelines on SLAs, metadata standards etc. Sounds good in theory, I am eager to see how it fails in practice

KmVFIz4y ago

I can chime in as part of the central team for SLAs, etc. We offer a platform to produce datasets given some inputs, SQL, and pushes to downstream systems. Standardized jobs are ran after the user’s SQL to produce standardized outputs.

It works well, but has many issues too. User’s SQLs and input data can differ, often in unpredictable ways, because they bring their own and expect the central team to handle the rest. Those edge cases break the standardization rules, fails the workflow, confuses the user because the platform is a black box, and they ask about changing it or adding a new feature. Now your standardization asks are bottle-necked by this central team, and the options are:

- to wait for the central team to fix/improve it

- find some hack around the platform

- don’t use the platform and its associated toolings, so you build it yourself and have another disjoint system for a specific use case

- central team might build a feature that one team asked for 1 years ago, but now nobody needs it anymore and nobody knows why it’s in code. Repeat many many times for various asks over the years and your code base is likely a foreign mess.

- give your resources/funding to the central team to prioritize your ask. When built and a few years later, the central team owns something they themselves never wanted.

shoo4y ago

many points of internet karma (and perhaps a profitable career as a consultant) awaits anyone who spills the beans on how their grand data mesh rearchitecture actually turned out a few years down the track, and if the exciting new problems caused by the data mesh were easier or harder to deal with than the boring old problems caused by the organisational and IT architecture it replaced.

1 more reply

pklee4y ago

The concept of a data-mesh is more of a business concept as opposed to tech. IMHO the idea being proposed is that of a conceptual data-server (not to be confused with database server) much like a HTTP server / Mail Server where people can engage with data as a first class citizen and create "data" products. This is especially true as we move from HTML to somewhat HDML (Hyper data markup).

By making data as the product (abstracting all the gory details), you are fundamentally engaging with data through a UI or an API. As you expose these products they become accretive while fundamentally encapsulating the domain expertise within them.

mountainriver4y ago

This seems like mostly common sense. Infrastructure teams should always be building tools that the org consumes (and ideally the general public)

In a lot of orgs this goes sideways and the infrastructure teams end up owning everything and never have time to do anything else. Usually this happens due to upper management putting on the squeeze.

In order for teams to actually own their infrastructure and data we need better tooling to help them. This is coming along nowadays but isn’t fully there.

cwp4y ago

Dunno about the merits of this, but it does seem to be part of the overall effort to rethink how to organize large groups of people working together. With the internet we can afford peer-to-peer communication, and we don't have to organize into hierarchies. But we can't just do full-mesh communication either, because that's overwhelming to individuals, as anyone who lived through the initial slack-and-zoom remote work of early 2020 can tell you. (Though lots of people are still living through it, unfortunately)

So what kind of communication structures are good, and in what circumstances? How do we structure work so that we don't have to communicate about everything? When do we fall back to ad-hoc video chat or even in-person meetings? These are the kinds of questions that 21st-century management has to answer. It's fascinating to watch people grapple with them.

ako4y ago

Lots of concerns and scepticism in the discussions here. Any suggestions about good, achievable data strategies and data architecture that work at enterprise level?

zbentley4y ago

Require domain teams' code to communicate (with other domain teams and with the outside world) using the same pathways, schemas, and contracts that are used when extracting a domain team's data into a data lake.

Whether or not that data lake is semi-operated by the team (as proposed in the article) or operated centrally, requiring the lake's ETL process to use at least some of the APIs and tools used for transactional interaction goes a long way towards making data architecture tend towards sanity.

Resist the temptation of things like RDBMS-level CDC/log stream capture or database snapshots for populating data lakes (RDS Aurora's snapshot export/restore is like methamphetamine in this area: incredibly fast and powerful, has a very severe long term cost for data lake uniformity and usability).

I'm not saying "every row in the data lake must be extracted by making the exact same API hit that an internet user would make, with all of the overhead incurred by that". You can tap into the stack at a lower level than that (e.g. use the same DAOs that user APIs use when populating the data lake, but skip the whole web layer). Just don't tap into the lowest possible layer of the stack for data lake ETL--even though that lowest layer is probably the quickest to get working and most performant, it results in poor data hygiene over the medium and long term.

timwis4y ago

It sounds almost entirely about team responsibility and governance, rather than technical architecture. What’s the difference from a data lake on a technical level?

beckingz4y ago

You get a data lake per team/service.

shoo4y ago

what's a good word for a region with a bunch of small lakes? the fens?

"Data Mesh" is much trendier branding than "Data Fenlands".

2 more replies

zozbot2344y ago

Isn't this usually called a "data mart" as opposed to "data mesh"? Or is the "mesh" term intended to point to something more unstructured, like team- or business division-level equivalent to a data lake? But isn't that just a data pond?

thehappypm4y ago

Mesh implies that there are clearly defined join keys on each dataset that allows you do join across domains.

E-commerce example: the warehouse team might produce a batch of datasets, and the web site time as well. A data lake approach would have a single data team owning both sets of datasets. A data mesh would have each team be responsible for maintaining their own, and making sure they’re interoperable (like having a shared order ID concept).

sdze4y ago

If you need so many "slides" to persuade your clients of something, I think you lost already.

rad_gruchalski4y ago

Considering how many big companies go about implementing this right now, I don’t agree. C line likes slides.

MikeDelta4y ago

Indeed, the Future State Architecture documentation from the central architects that I have seen were all powerpoint presentations with at least 100 slides.

j / k navigate · click thread line to collapse

45 comments

kevinsundar4y ago

Wow this is an oversimplification. I've had years of experience working in a data lake within a FAANG handling > 5 PBs of data per day ingest. There's so many things this misses:

5. What do you do when a domain team has a lot of important data but no engineering resources? Do you just say "oh well, we're just a self-service data platform so no one gets to access the data"?

d--b4y ago

At some point one has to ask, what are you guys doing ingesting 5pb of data per day?!

Unless this is google, that doesn’t make any sense. That’s an average of 7.5mb per human on the planet, every day.

fn14y ago

Why the concept "data lake" emerged in the first place:

Thinking about what to store (and what to log) is not trivial and takes careful consideration. Plus, there's always the argument: "But what if we need something that we forgot to store or log?".

Answering that takes time and risk-acceptance that most developers in most projects don't get.

"Datalake" was just a pseudo-solution to gain peace-of-mind: We throw everything into a big bucket and figure it out later.

ricardobeat4y ago

Just like performance optimization, it's cheaper to buy more hardware than to pay for humans to think about it and coordinate.

kevinsundar4y ago

I did say FAANG :)

Believe it or not this is just for security data. But that fact combined with SOA leads to lots of logs. (5Pbs is the uncompressed amount, but we decompress incoming data, then ETL).

1 more reply

shoo4y ago

cgio4y ago

ceeplusplus4y ago

1 more reply

jerglingu4y ago

On point 1: "That's not gonna happen in a large company."

This doesn't happen in small or mid companies, either. Or if it does happen, it happens begrudgingly. SWE's have too much to do.

robertlagrant4y ago

It really feels like data mesh is a fairly half baked concept born out of short term consulting gigs and a desire to become a technical thought leader.

jerglingu4y ago

[1]: https://martinfowler.com/articles/data-monolith-to-mesh.html

[2]: https://martinfowler.com/articles/data-monolith-to-mesh.html...

i_like_waiting4y ago

Reminds me of first OLAP cubes a lot, something that consultant online praise as much as possible, just so then 3-4 years later they are contracted by the company to fix the mess it created.

edmundsauto4y ago

What are the downsides of OLAP cubes, and how were they fixed? Curious to level up my understanding.

1 more reply

politelemon4y ago

Is there an underlying assumption here that all of the datasets' domains are perfectly in sync with each other in the context of domain metadata?

cgio4y ago

gxt4y ago

i_like_waiting4y ago

what type of company you are working for? Usually there is not even CIO, I haven't even heard about company with both CDO and CDEO (or even CDEO itself).

I thought big portion of need that data mesh fills is the organizations who are missing resources in their core BI team.

2 more replies

LaserToy4y ago

I looks like a weird attempt to build a consulting business around a simple idea.

Treat data assets like micro services and pipelines like network. Period.

Prescribing everything else rubs me wrong way.

So, data mesh is: architecture in which data in the company organized in loosely coupled data assets.

i_like_waiting4y ago

So if I understand this correctly, data mesh is just data mart, that doesn't bring data in database as a table, but uses S3 storage instead (I assume because thats cheaper in the cloud?)

skrrr4y ago

KmVFIz4y ago

- to wait for the central team to fix/improve it

- find some hack around the platform

- don’t use the platform and its associated toolings, so you build it yourself and have another disjoint system for a specific use case

- give your resources/funding to the central team to prioritize your ask. When built and a few years later, the central team owns something they themselves never wanted.

shoo4y ago

1 more reply

pklee4y ago

mountainriver4y ago

This seems like mostly common sense. Infrastructure teams should always be building tools that the org consumes (and ideally the general public)

In a lot of orgs this goes sideways and the infrastructure teams end up owning everything and never have time to do anything else. Usually this happens due to upper management putting on the squeeze.

In order for teams to actually own their infrastructure and data we need better tooling to help them. This is coming along nowadays but isn’t fully there.

cwp4y ago

ako4y ago

Lots of concerns and scepticism in the discussions here. Any suggestions about good, achievable data strategies and data architecture that work at enterprise level?

zbentley4y ago

timwis4y ago

It sounds almost entirely about team responsibility and governance, rather than technical architecture. What’s the difference from a data lake on a technical level?

beckingz4y ago

You get a data lake per team/service.

shoo4y ago

what's a good word for a region with a bunch of small lakes? the fens?

"Data Mesh" is much trendier branding than "Data Fenlands".

2 more replies

zozbot2344y ago

thehappypm4y ago

Mesh implies that there are clearly defined join keys on each dataset that allows you do join across domains.

sdze4y ago

If you need so many "slides" to persuade your clients of something, I think you lost already.

rad_gruchalski4y ago

Considering how many big companies go about implementing this right now, I don’t agree. C line likes slides.

MikeDelta4y ago

Indeed, the Future State Architecture documentation from the central architects that I have seen were all powerpoint presentations with at least 100 slides.

j / k navigate · click thread line to collapse