Turning the database inside-out with Apache Samza (opens in new tab)

(blog.confluent.io)

232 pointsmartinkl11y ago64 comments

64 comments

Immutability is hardly a cure-all, see the discussion here for why RethinkDB moved away from it: http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-g...

The reality is shared, mutable state is the most efficient way of working with memory-sized data. People can rant and rave all they want about the benefits of immutability vs mutability, but at the end of the day, if performance is important to you, you'd be best to ignore them.

Actually, to be more honest, reality is more complicated still. MVCC that many databases use to get ACID semantics over a shared mutable dataset is really a combination of mutable and immutable.

coffeemug11y ago

Slava @ rethink here.

This is a really interesting subject -- I should do a talk/blog post about this at some point. Here is a quick summary.

RethinkDB's storage engine heavily relies on the notion of immutability/append-only. We never modify blocks of data in place on disk -- all changes are recorded in new blocks. We have a concurrent, incremental compaction algorithm that goes through the old blocks, frees the ones that are outdated, and moves things around when some blocks have mostly garbage.

The system is very fast and rock solid. But...

Getting a storage engine like that to production state is an enormous amount of work and takes a very long time. Rethink's storage engine is really a work of art -- I consider it a marvel of engineering, and I don't mean that as a compliment. If we were starting from scratch, I don't think we'd use this design again. It's great now, but I'm not sure if all the work we put into it was ultimately worth the effort.

boredandroid11y ago

I really think there are a couple of levels of immutability that it is easy to conflate.

Specifically immutability for

1. In memory data structures...this is the contention of the functional programming people.

2. Persistent data stores. This is the lsm style of data structure that substitutes linear writes and compaction for buffered in-place mutation.

3. Distributed system internals--this is a log-centric, "state machine replication" style of data flow between nodes. This is a classic approach in distributed databases, and present in systems like PNUTs.

4. Company-wide data integration and processing around streams of immutable records between systems. This is what I have argued for (http://engineering.linkedin.com/distributed-systems/log-what...) and I think Martin is mostly talking about.

There are a lot of analogies between these but they aren't the same. Success of one of these things doesn't really imply success for any of the others. Functional programming could lose and log-structured data stores could win or vice versa. Pat Helland has made an across the board call for immutability (http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf), but that remains a pretty strong assertion. So it is worth being specific about which level you are thinking about.

For my part I am pretty bullish about stream processing and data flow between systems being built around a log or stream of immutable records as the foundational abstraction. But whether those systems internally are built in functional languages, use lsm style data layout on disk is kind of an implementation detail. From my point of view immutability is a lot more helpful in the large than in the small--I have never found small imperative for loops particularly hard to read, but process-wide mutable state is a big pain, and undisciplined dataflow between disparate systems, caches, and applications at the company level can be a real disaster.

2 more replies

eloff11y ago

It's that merge (edit: GC is a better term) step that's difficult to get right. Google screwed this up badly with LevelDB which had(still has?) horrible performance issues caused by compaction. Even with concurrent compaction it can be difficult due to needing additional disk space, adding additional read and write pressure to the storage subsystem and the effects that has on latency. I'm not sure what RethinkDB's approach was there, but I'm very curious to know.

1 more reply

phunge11y ago

Please do -- I've been reading the linkedin/confluent/samza writeups and thinking there's a lot of truth to their ideas. It'd be great to hear more on-the-ground experience from a different perspective.

elliptic11y ago

Is this not how most 'modern' (read 90s) relational db's work?

1 more reply

amelius11y ago

Isn't that just called "journaling", as opposed to "immutable"?

1 more reply

whatthemick11y ago

Everything obviously has trade offs, no choice is perfect in everyway.

For me the pros of having data as an immutable stream of events (eventsourcing) is that you get migrations and data modeling for free - You don't have to deal with having to design the "perfect" data model in advance (or worry about schema/data migrations later on) and you can get caching as first level data rather as derived from another store.

sitkack11y ago

Actually @eloff and the OP are arguing past each other. Both are wrong in different ways.

> Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?

This isn't true. Databases _can_ be those things, but that isn't the definition of a database. Most of the databases I have worked and created do not use update or delete except to archive old data that is no longer in the working set.

And mutability isn't always faster. Most of the time when people are championing mutability, it is because it is the most expedient (esp with their mental model), not because from a whole system standpoint it is actually faster. They trot out a microbenchmark that proves their point while ignoring use cases like retrieving old state or auditing the transaction history.

1 more reply

eloff11y ago

It has it's advantages to be sure, I really like eventsourcing. For some kinds of projects it's the obvious way to go. Finance seems like a killer application, because of the way the entire audit trail is stored, and the system can often be rebuilt to a valid state upon finding and correcting a production bug.

1 more reply

shin_lao11y ago

Agreed.

When designing our product it became quickly apparent that immutability isn't practical, we opted for MVCC ACID transactions, but the most difficult part of MVCC database is getting the purge right.

You can cheat purges with some clever optimization but at some point, you need to clean up old versions and when you do that, you are using precious I/O.

Getting purge/compaction right is hard. Update intensive scenarii are always problematic for databases.

eloff11y ago

As a matter of interest, what product is that?

I used some tricks to reduce the amount of data that must be cleaned up at any given point in time, but it was not possible to evade it completely. I still have to do concurrent compaction, and there's some really nasty corner cases I don't have any good solutions for, it's a hard problem.

boredandroid11y ago

I agree with this comment, but I think this blog post is really addressing data flow at the company-wide or datacenter scale. Surprisingly there immutable event data hasn't really had much of a home at all beyond data warehousing.

pavlov11y ago

... most self-respecting developers have got rid of mutable global variables in their code long ago.

I'm not convinced that's the case. Almost everyone has merely hidden their mutable globals under layers of abstractions. Things like "singletons", "factories", "controllers", "service objects", "dependency injection" are the vernacular of the masked-globals game.

danellis11y ago

None of those things you said imply mutability. (Okay, maybe singletons, depending on the implementation.)

pavlov11y ago

True, but in practice they tend to be used as containers or initializers for mutable variables.

bmh10011y ago

As one who works with analytics databases and ETL (extract-transform-load) processes a great deal, immutability of data stores is an incredibly valuable property. Maybe append-only does not make sense in operational databases all the time, but for non-real-time analytics, it makes a huge amount of sense. In my case, operational data is queried, optimized for storage space and quick loading, and cached to disk. Because it is an analytics database used for longer-term analysis and planning, daily queries of operational data are sufficient in many cases. Operational workload is not even a consideration. The ETL process also allows for "updating" records in the "T" (transform) part. Updates to operational data are not even necessary, and often impossible, so correcting and enhancing the data for decision making is a huge win for clients. Issues similar to "compaction time" can still occur, but an ETL approach allows for many clean ways of controlling the process and avoiding those failure scenarios.

boredandroid11y ago

Anyhow in the Bay Area interested in learning more about Apache Samza should attend the meetup tonight in Mountain View: http://www.meetup.com/Bay-Area-Samza-Meetup/events/220354853...

shanemhansen11y ago

I'm not sold on Samza, but I can tell you that creating isolated services that create their datastore from a stream of events is a really useful pattern in some use cases (ad-tech).

I've made use of NSQ to stream user update events (products viewed, orders placed) to servers sitting at the network edge which cache the info in leveldb. Our request latency was something like 10 microseconds over go's json/rpc. We weren't even able to come close to that in the other nosql database servers we tried, even with aggressive caching turned on.

anonymousDan11y ago

What don't you like about Samza out of interest? Something fundamental with their model or more implementation related?

shanemhansen11y ago

I've seen organizations have lots of trouble operationally with kafka (which samza uses). I've seen NSQ be extremely reliable operationally.

However they offer very different guarantees so it's an apples to oranges comparison. NSQ isn't really designed to provide a replayable history, although you can fake it by registering a consumer which does nothing but log to file (nsq_to_file) and that works pretty well.

(disclaimer: the nsq mailing list has lots of chatter these days, nsq may be growing features I'm not aware of)

1 more reply

sivers11y ago

Similar interesting talk by Rich Hickey:

http://www.infoq.com/presentations/Value-Values

luddypants11y ago

I was wondering how this relates to Datomic... I'm not really familiar enough to say much about similarities and differences, but would be interested if someone who is could comment.

ludwigvan11y ago

I asked the same question at the end of his talk, see the relevant section in the video:

https://www.youtube.com/watch?v=fU9hR3kiOK0&t=2579

vkjv11y ago

You can do similar "magic" cache invalidation with Elasticsearch and the percolate feature. Each time you do a query and cache some transformation of the result, put that query in a percolate index. Then when you change a document, run the document against the percolate index and, voila, you get the queries that would have returned it and can then invalidate your cache.

This method of cache invalidation fails in a very key place though (just like in the article). What happens if you change a very core thing that invalidates a large percentage of the cache?

fizx11y ago

What you're hoping for is that some cacheable function of many documents is also a monoid.

In an example, you're hoping that when you invalidate the query "SELECT COUNT(*) FROM foo WHERE x = 1" because a new document that matched came in, you're simply incrementing the existing cached value, rather than rescanning the database index.

bonobo300011y ago

This is a cool idea - the holy grail scenario I'm envisioning is storing all data in the log i.e

1. the transaction log is a central repository for all data 2. much more detailed data is stored, enough that analytics and can run off this same source of data

The amount of data generated increases proportional to the number of updates on a row/piece of data whereas with a mutable solution, it is constant w.r.t number of updates on the same data. That is a pretty big scaling difference.

However, storing that much data translates to much higher costs for HDDs/servers, or possibly lower write performance if the log is stored on something like HDFS.

There would also be performance costs for building and updating a materialized view. Imagine a scenario like this:

Events -> A B C D E F G H I J K Materialized view M has been computed up to item J (but not K yet) Read/Query M

Now either writing K incurs the cost of waiting for all dependent views to materialize, or the read on M incurs the cost of updating M.

Some fusion of this would be pretty interesting though. For example, what if we just query on M without applying any updates if there have been <X updates? That translates to similar guarantees as an eventually consistent DB - the data could be stale. Atleast it gives us more control over this tradeoff.

swah11y ago

I really enjoyed reading about Storm too: http://nathanmarz.com/blog/history-of-apache-storm-and-lesso...

This kind of "competition" leads to analysis paralysis though. Its much better when there is a single winner...

sitkack11y ago

You mean like Hadoop? I disagree, the popular bad solution starves out innovation by sucking all the air out of the room. Easier to decide on the globally bad choice.

swah11y ago

I meant Samza and Storm. I thought both of them could run on Hadoop.

1 more reply

bambax11y ago

A more promising model, used in some systems, is to think of a database as an always-growing collection of immutable facts.

That would already be a huge progress over how databases are currently used; if records were in fact immutable many problems would be instantly solved.

hyc_symas11y ago

You would just be trading them for the intensely ugly problem of garbage collection. Disk space is cheap, but it's not infinitely cheap. There are plenty of append-only data stores out there now, and they all suffer from compaction-related performance issues.

steve-rodrigue11y ago

Does anyone knows which app has been used to create the "handwritten" images? I draw very badly so I'm looking for such an app to explain data flows on a corporate blog/wiki.

discardorama11y ago

It looks like the free app "Paper" by FiftyThree, available on iPads: https://itunes.apple.com/us/app/paper-by-fiftythree/id506003...

agentultra11y ago

From the comments in the article, it appears to be the Paper app for iOS/iPad and a stylus.

jamii11y ago

Windows Journal is also perfectly serviceable - https://drive.google.com/file/d/0Bxjbk6tMrOKQcXhKT1dIVkQ5ZVE...

felixthehat11y ago

Also Draw by Adobe has similar results (free!) https://itunes.apple.com/us/app/adobe-draw/id911156590

hyc_symas11y ago

Streams - another reinvention of LDAP Persistent Search.

Yes, there really are protocols that handle single request/multiple response interactions, and they've been around for decades. Unlike crap built on HTTP, which was never intended for uses like this, these protocols work well with multiple concurrent requests in flight simultaneously, etc.

hyperliner11y ago

Conceptually, one of the challenges of streams as first class citizens is that humans don't do well with them. For the purposes of analysis, humans need a "snapshot" or fix on the data. This way they can derive insights from the data and act on human things. The reality is that, for many real-world scenarios, a real-time view of the data is not just a luxury, it's actually a drawback, because data changes are noisy. Many human problems deal with abstract representations of the actual data, and so imprecision is part of the problem.

I really like the talk from the point of view of simplifying the system-wide problems caused by a gigantic mutable state. But I feel that at the border of system to humans there will be other issues to discuss.

fiatjaf11y ago

This is CouchDB, right?

j / k navigate · click thread line to collapse

64 comments

slashdev11y ago

Immutability is hardly a cure-all, see the discussion here for why RethinkDB moved away from it: http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-g...

Actually, to be more honest, reality is more complicated still. MVCC that many databases use to get ACID semantics over a shared mutable dataset is really a combination of mutable and immutable.

coffeemug11y ago

Slava @ rethink here.

This is a really interesting subject -- I should do a talk/blog post about this at some point. Here is a quick summary.

The system is very fast and rock solid. But...

boredandroid11y ago

I really think there are a couple of levels of immutability that it is easy to conflate.

Specifically immutability for

1. In memory data structures...this is the contention of the functional programming people.

2. Persistent data stores. This is the lsm style of data structure that substitutes linear writes and compaction for buffered in-place mutation.

2 more replies

eloff11y ago

1 more reply

phunge11y ago

elliptic11y ago

Is this not how most 'modern' (read 90s) relational db's work?

1 more reply

amelius11y ago

Isn't that just called "journaling", as opposed to "immutable"?

1 more reply

whatthemick11y ago

Everything obviously has trade offs, no choice is perfect in everyway.

sitkack11y ago

Actually @eloff and the OP are arguing past each other. Both are wrong in different ways.

1 more reply

eloff11y ago

1 more reply

shin_lao11y ago

Agreed.

When designing our product it became quickly apparent that immutability isn't practical, we opted for MVCC ACID transactions, but the most difficult part of MVCC database is getting the purge right.

You can cheat purges with some clever optimization but at some point, you need to clean up old versions and when you do that, you are using precious I/O.

Getting purge/compaction right is hard. Update intensive scenarii are always problematic for databases.

eloff11y ago

As a matter of interest, what product is that?

boredandroid11y ago

pavlov11y ago

... most self-respecting developers have got rid of mutable global variables in their code long ago.

danellis11y ago

None of those things you said imply mutability. (Okay, maybe singletons, depending on the implementation.)

pavlov11y ago

True, but in practice they tend to be used as containers or initializers for mutable variables.

bmh10011y ago

boredandroid11y ago

Anyhow in the Bay Area interested in learning more about Apache Samza should attend the meetup tonight in Mountain View: http://www.meetup.com/Bay-Area-Samza-Meetup/events/220354853...

shanemhansen11y ago

I'm not sold on Samza, but I can tell you that creating isolated services that create their datastore from a stream of events is a really useful pattern in some use cases (ad-tech).

anonymousDan11y ago

What don't you like about Samza out of interest? Something fundamental with their model or more implementation related?

shanemhansen11y ago

I've seen organizations have lots of trouble operationally with kafka (which samza uses). I've seen NSQ be extremely reliable operationally.

(disclaimer: the nsq mailing list has lots of chatter these days, nsq may be growing features I'm not aware of)

1 more reply

sivers11y ago

Similar interesting talk by Rich Hickey:

http://www.infoq.com/presentations/Value-Values

luddypants11y ago

I was wondering how this relates to Datomic... I'm not really familiar enough to say much about similarities and differences, but would be interested if someone who is could comment.

ludwigvan11y ago

I asked the same question at the end of his talk, see the relevant section in the video:

https://www.youtube.com/watch?v=fU9hR3kiOK0&t=2579

vkjv11y ago

This method of cache invalidation fails in a very key place though (just like in the article). What happens if you change a very core thing that invalidates a large percentage of the cache?

fizx11y ago

What you're hoping for is that some cacheable function of many documents is also a monoid.

bonobo300011y ago

This is a cool idea - the holy grail scenario I'm envisioning is storing all data in the log i.e

1. the transaction log is a central repository for all data 2. much more detailed data is stored, enough that analytics and can run off this same source of data

However, storing that much data translates to much higher costs for HDDs/servers, or possibly lower write performance if the log is stored on something like HDFS.

There would also be performance costs for building and updating a materialized view. Imagine a scenario like this:

Events -> A B C D E F G H I J K Materialized view M has been computed up to item J (but not K yet) Read/Query M

Now either writing K incurs the cost of waiting for all dependent views to materialize, or the read on M incurs the cost of updating M.

swah11y ago

I really enjoyed reading about Storm too: http://nathanmarz.com/blog/history-of-apache-storm-and-lesso...

This kind of "competition" leads to analysis paralysis though. Its much better when there is a single winner...

sitkack11y ago

You mean like Hadoop? I disagree, the popular bad solution starves out innovation by sucking all the air out of the room. Easier to decide on the globally bad choice.

swah11y ago

I meant Samza and Storm. I thought both of them could run on Hadoop.

1 more reply

bambax11y ago

A more promising model, used in some systems, is to think of a database as an always-growing collection of immutable facts.

That would already be a huge progress over how databases are currently used; if records were in fact immutable many problems would be instantly solved.

hyc_symas11y ago

steve-rodrigue11y ago

Does anyone knows which app has been used to create the "handwritten" images? I draw very badly so I'm looking for such an app to explain data flows on a corporate blog/wiki.

discardorama11y ago

It looks like the free app "Paper" by FiftyThree, available on iPads: https://itunes.apple.com/us/app/paper-by-fiftythree/id506003...

agentultra11y ago

From the comments in the article, it appears to be the Paper app for iOS/iPad and a stylus.

jamii11y ago

Windows Journal is also perfectly serviceable - https://drive.google.com/file/d/0Bxjbk6tMrOKQcXhKT1dIVkQ5ZVE...

felixthehat11y ago

Also Draw by Adobe has similar results (free!) https://itunes.apple.com/us/app/adobe-draw/id911156590

hyc_symas11y ago

Streams - another reinvention of LDAP Persistent Search.

hyperliner11y ago

fiatjaf11y ago

This is CouchDB, right?

j / k navigate · click thread line to collapse