The reality is shared, mutable state is the most efficient way of working with memory-sized data. People can rant and rave all they want about the benefits of immutability vs mutability, but at the end of the day, if performance is important to you, you'd be best to ignore them.
Actually, to be more honest, reality is more complicated still. MVCC that many databases use to get ACID semantics over a shared mutable dataset is really a combination of mutable and immutable.
This is a really interesting subject -- I should do a talk/blog post about this at some point. Here is a quick summary.
RethinkDB's storage engine heavily relies on the notion of immutability/append-only. We never modify blocks of data in place on disk -- all changes are recorded in new blocks. We have a concurrent, incremental compaction algorithm that goes through the old blocks, frees the ones that are outdated, and moves things around when some blocks have mostly garbage.
The system is very fast and rock solid. But...
Getting a storage engine like that to production state is an enormous amount of work and takes a very long time. Rethink's storage engine is really a work of art -- I consider it a marvel of engineering, and I don't mean that as a compliment. If we were starting from scratch, I don't think we'd use this design again. It's great now, but I'm not sure if all the work we put into it was ultimately worth the effort.
Specifically immutability for
1. In memory data structures...this is the contention of the functional programming people.
2. Persistent data stores. This is the lsm style of data structure that substitutes linear writes and compaction for buffered in-place mutation.
3. Distributed system internals--this is a log-centric, "state machine replication" style of data flow between nodes. This is a classic approach in distributed databases, and present in systems like PNUTs.
4. Company-wide data integration and processing around streams of immutable records between systems. This is what I have argued for (http://engineering.linkedin.com/distributed-systems/log-what...) and I think Martin is mostly talking about.
There are a lot of analogies between these but they aren't the same. Success of one of these things doesn't really imply success for any of the others. Functional programming could lose and log-structured data stores could win or vice versa. Pat Helland has made an across the board call for immutability (http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf), but that remains a pretty strong assertion. So it is worth being specific about which level you are thinking about.
For my part I am pretty bullish about stream processing and data flow between systems being built around a log or stream of immutable records as the foundational abstraction. But whether those systems internally are built in functional languages, use lsm style data layout on disk is kind of an implementation detail. From my point of view immutability is a lot more helpful in the large than in the small--I have never found small imperative for loops particularly hard to read, but process-wide mutable state is a big pain, and undisciplined dataflow between disparate systems, caches, and applications at the company level can be a real disaster.
For me the pros of having data as an immutable stream of events (eventsourcing) is that you get migrations and data modeling for free - You don't have to deal with having to design the "perfect" data model in advance (or worry about schema/data migrations later on) and you can get caching as first level data rather as derived from another store.
> Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?
This isn't true. Databases _can_ be those things, but that isn't the definition of a database. Most of the databases I have worked and created do not use update or delete except to archive old data that is no longer in the working set.
And mutability isn't always faster. Most of the time when people are championing mutability, it is because it is the most expedient (esp with their mental model), not because from a whole system standpoint it is actually faster. They trot out a microbenchmark that proves their point while ignoring use cases like retrieving old state or auditing the transaction history.
When designing our product it became quickly apparent that immutability isn't practical, we opted for MVCC ACID transactions, but the most difficult part of MVCC database is getting the purge right.
You can cheat purges with some clever optimization but at some point, you need to clean up old versions and when you do that, you are using precious I/O.
Getting purge/compaction right is hard. Update intensive scenarii are always problematic for databases.
I used some tricks to reduce the amount of data that must be cleaned up at any given point in time, but it was not possible to evade it completely. I still have to do concurrent compaction, and there's some really nasty corner cases I don't have any good solutions for, it's a hard problem.
I'm not convinced that's the case. Almost everyone has merely hidden their mutable globals under layers of abstractions. Things like "singletons", "factories", "controllers", "service objects", "dependency injection" are the vernacular of the masked-globals game.
I've made use of NSQ to stream user update events (products viewed, orders placed) to servers sitting at the network edge which cache the info in leveldb. Our request latency was something like 10 microseconds over go's json/rpc. We weren't even able to come close to that in the other nosql database servers we tried, even with aggressive caching turned on.
However they offer very different guarantees so it's an apples to oranges comparison. NSQ isn't really designed to provide a replayable history, although you can fake it by registering a consumer which does nothing but log to file (nsq_to_file) and that works pretty well.
(disclaimer: the nsq mailing list has lots of chatter these days, nsq may be growing features I'm not aware of)
This method of cache invalidation fails in a very key place though (just like in the article). What happens if you change a very core thing that invalidates a large percentage of the cache?
In an example, you're hoping that when you invalidate the query "SELECT COUNT(*) FROM foo WHERE x = 1" because a new document that matched came in, you're simply incrementing the existing cached value, rather than rescanning the database index.
1. the transaction log is a central repository for all data 2. much more detailed data is stored, enough that analytics and can run off this same source of data
The amount of data generated increases proportional to the number of updates on a row/piece of data whereas with a mutable solution, it is constant w.r.t number of updates on the same data. That is a pretty big scaling difference.
However, storing that much data translates to much higher costs for HDDs/servers, or possibly lower write performance if the log is stored on something like HDFS.
There would also be performance costs for building and updating a materialized view. Imagine a scenario like this:
Events -> A B C D E F G H I J K Materialized view M has been computed up to item J (but not K yet) Read/Query M
Now either writing K incurs the cost of waiting for all dependent views to materialize, or the read on M incurs the cost of updating M.
Some fusion of this would be pretty interesting though. For example, what if we just query on M without applying any updates if there have been <X updates? That translates to similar guarantees as an eventually consistent DB - the data could be stale. Atleast it gives us more control over this tradeoff.
This kind of "competition" leads to analysis paralysis though. Its much better when there is a single winner...
That would already be a huge progress over how databases are currently used; if records were in fact immutable many problems would be instantly solved.
Yes, there really are protocols that handle single request/multiple response interactions, and they've been around for decades. Unlike crap built on HTTP, which was never intended for uses like this, these protocols work well with multiple concurrent requests in flight simultaneously, etc.
I really like the talk from the point of view of simplifying the system-wide problems caused by a gigantic mutable state. But I feel that at the border of system to humans there will be other issues to discuss.