Kafka Is Not a Database (opens in new tab)

(materialize.com)

313 pointsandrioni5y ago165 comments

165 comments

Alternatively from Jay Krebs [1] a much more thorough and nuanced discussion that is probably the best send-up on this topic.

"So is it crazy to do this? The answer is no, there’s nothing crazy about storing data in Kafka: it works well for this because it was designed to do it. Data in Kafka is persisted to disk, checksummed, and replicated for fault tolerance. Accumulating more stored data doesn’t make it slower. There are Kafka clusters running in production with over a petabyte of stored data."

[1] https://www.confluent.io/blog/okay-store-data-apache-kafka/

georgewfraser5y ago

That post explains that there are scenarios where it makes sense to store data permanently in Kafka. "Kafka is Not a Database" makes a different point, which is that Kafka doesn't solve any of the hard transaction-processing problems that database systems do, so it's not an alternative to a traditional DBMS. This is not a straw man---Confluent advocates for "turning the database inside out" all over their marketing materials and conference talks.

hitekker5y ago

Indeed. The HN community has been quite negative on Confluent marketing Kafka as a database.

https://news.ycombinator.com/item?id=23206566

Apache Kafka also doesn’t refer to itself as a database in its own documentation:

https://kafka.apache.org/documentation/#introduction

cfontes5y ago

It actually does with |Exactly-once Semantics| in fact I've been using as the single source of truth in a cash management system for almost 2 years without a single issue related to transactions.

jashmatthews5y ago

How do you deal with side effects outside of the Kafka cluster?

1 more reply

hodgesrm5y ago

Thanks, this is a great article. The money quote for me is:

> I think it makes more sense to think of your datacenter as a giant database, in that database Kafka is the commit log, and these various storage systems are kinds of derived indexes or views.

My company supports analytic systems and we see this pattern constantly. It's also sort of a Pat Helland view of the world that subsumes a large fraction of data management under a relatively simple idea. [1] What's interesting is that Pat also sees it as a way to avoid sticky coordination problems as well.

[1] http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

skyde5y ago

The problem is that Kafka API and Commit log API are very different.

If you wanted to literally use Kafka for your commit log the same way the Amazon aurora are using a distributed commit log. You would find that a lot of feature a commit log need are missing and impossible to add to kafka.

arduanika5y ago

Greenspun's 11th law: Any sufficiently stateful program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQL / relational databases.

1 more reply

Aqueous5y ago

Yes. as the article points out there's no way to reject an event based on some criteria that keeps the events internally consistent. for instance, two people can't check out the same item of inventory = 1. you need to first validate this event on the materialized view of the commit log to make sure the item is still available. however, what happens when your validation goes out of date? For example, what if there's an event checking the last inventory in the event log but the materialized view doesn't reflect that yet? you end up checking out the same inventory twice and promising it to both customers! not good.

DBs can solve this through optimistic locking (atomically making sure my new event applies to the version number of the previous event + 1, otherwise failing) but there's no way to do this in Kafka as far as I know (however, as this is a problem i'm currently facing, please feel free to let me know if there is)

hinkley5y ago

I’d really love for this to be formalized.a standard format streaming the data through several Susie,s and keeping them eventually consistent.

gopalv5y ago

> Accumulating more stored data doesn’t make it slower

That is a valid theory when we talk about readers which look at recent data or when you are trying to append data to the existing system.

But in practice, the accumulation of cold data on a local disk is where this starts to hurt, particularly if that has to serve read traffic which starts from the beginning of time (i.e your queries don't start with a timestamp range).

KSQL transforms does help reduce the depth of the traversal, by building flatter versions of the data set, but you need to repartition the same data on every lookup key you want - so if you had a video game log trace, you'd need multiple materializations for (user) , (user,game), (game) etc.

And on this local storage part, EBS is expensive to just hold cold data, but then replicate it to maintain availability during a node recovery - EBS is like a 1.5x redundant store, better than a single node. I liked the Druid segment model of shoving it off to S3 and still being to read off it (i.e not just stream to S3 as a dumping ground).

When Pravega came out, I liked it a lot for the same - but it hasn't gained enough traction.

pram5y ago

Log compaction definitely isn’t problem free. I’d say it was crazy when he wrote that.

We ran a cluster with lots of compacted topics, hundreds of terabytes of data. At the time it would make broker startup insanely slow. An unclean startup could literally take an hour to go through all the compacted partitions. It was awful.

halbritt5y ago

I recommend his book, "I heart logs". It's a short read, but changed my perspective.

johnx123-up5y ago

OT: Are there any Kafka alternatives in Rust or Go or C? (for a non-JVM stack)

meditative5y ago

https://docs.nats.io/nats-streaming-concepts/intro

https://github.com/liftbridge-io/liftbridge

andrewem5y ago

*Kreps, not Krebs

AndrewKemendo5y ago

Yes sorry! Unfortunately too late to edit :(

netn3twebw3b5y ago

Agreed. I feel like the tech community is afflicted with collective functional fixedness, or some sort of essentialism.

At it’s core it’s electron state in hardware. So long as those limits are not incidentally exceeded, and you validate outputs, who really cares what gets loaded?

While we rip rare minerals from the ground and toss all that at scale every 3-5 years later we get economical over installing software.

So long as it offers the necessary order of operations to do the work, whatever.

https://en.m.wikipedia.org/wiki/Functional_fixedness

j-pb5y ago

Tbh, It's a weird blog post coming from the materialize folks, considering they know better.

The "event sourced" arch they sketched is missing pieces. Normaly you'd have single writer instances that are locked to the corresponding kafka partition, which ensure strong transactional guarantees, IF you need them.

Throwing shade for maketings sake is something that they should be above.

I mean c'mon, I'd argue that Postgres enhanced with Materialize isn't a database anymore either, but in a good sense!

It's building material. A hybrid between MQ, DB, backend logic & frontend logic.

The reduction in application logic and the increase in reliability you can get from reactive systems is insane.

SQL is declarative, reactive Materialize streams are declarative on a whole new level.

Once that tech makes it into other parts of computing like the frontend, development will be so much better, less code, less bug, a lot more fun.

Imagine that your react component could simply declare all the data it needs from a db, and the system will figure out all the caching and rerendering.

So yeah, they have awesome tech with many advantages, so I don't get why they bad-mouth other architectures.

georgewfraser5y ago

We're trying to address a real problem that is happening in our industry: VPs of eng and principal engineers at startups are adopting the "Kappa Architecture" / "Turning the Database Inside Out", without realizing how much functionality from traditional database systems they are leaving behind. This has led to a barrage of consistency bugs in everything from food-delivery apps to the "unread message count" in LinkedIn. We're at the peak of the hype cycle for Kafka, and it's being used in all kinds of places it doesn't belong. For 99% of companies, a traditional DBMS is the right foundation.

fierro5y ago

what is the Kappa architecture?

revertts5y ago

Term coined by Jay Kreps, see https://www.oreilly.com/radar/questioning-the-lambda-archite...

flqn5y ago

It's either a typo or a play on Kafka since "kappa" is a twitch meme

arjunnarayan5y ago

Hi! I'm one of the two authors here. At Materialize, we're definitely of the 'we are a bunch of voices, we are people rather than corp-speak, and you get our largely unfiltered takes' flavor. This is my (and George's from Fivetran) take. In particular this is not Frank's take, as you attribute below :)

> SQL is declarative, reactive Materialize streams are declarative on a whole new level.

Thank you for the kind words about our tech, I'm flattered! That said, this dream is downstream of Kafka. Most of our quibbles with the Kafka-as-database architecture are to do with the fact that that architecture neglects the work that needs to be done _upstream_ of Kafka.

That work is best done with an OLTP database. Funnily enough, neither of us are building OLTP databases, but this piece largely is a defense of OLTP databases (if you're curious, yes, I'd recommend CockroachDB), and their virtues at that head of the data pipeline.

Kafka has its place - and when its used downstream of CDC from said OLTP database (using, e.g. Debezium), we could not be happier with it (and we say so).

The best example is in foreign key checks. It is not good if you ever need to enforce foreign key checks (which translates to checking a denormalization of your source data _transactionally_ with deciding whether to admit or deny an event). This is something that you may not need in your data pipeline on day 1, but adding that in later is a trivial schema change with an OLTP database, and exceedingly difficult with a Kafka-based event sourced architecture.

> Normally you'd have single writer instances that are locked to the corresponding Kafka partition, which ensure strong transactional guarantees, IF you need them.

This still does not deal with the use-case of needing to add a foreign key check. You'd have to:

1. Log "intents to write" rather than writes themselves in Topic A 2. Have a separate denormalization computed and kept in a separate Topic B, which can be read from. This denormalization needs to be read until the intent propagates from Topic A. 3. Convert those intents into commits. 4. Deal with all the failure cases in a distributed system, e.g. cleaning up abandoned intents, etc.

If you use an OLTP database, and generate events into Kafka via CDC, you get the best of both worlds. And hopefully, yes, have a reactive declarative stack downstream of that as well!

vvern5y ago

> 1. Log "intents to write" rather than writes themselves in Topic A 2. Have a separate denormalization computed and kept in a separate Topic B, which can be read from. This denormalization needs to be read until the intent propagates from Topic A. 3. Convert those intents into commits. 4. Deal with all the failure cases in a distributed system, e.g. cleaning up abandoned intents, etc.

People do do this. I have done this. I wish I had been more principled with the error paths. It got there _eventually_.

It was a lot of code and complexity to ship a feature which in retrospect could have been nearly trivial with a transactional database. I'd say months rather than days. I won't get those years of my life back.

The products were build on top of Kafka, Cassandra, and Elasticsearch where, over time, there was a desire to maintain some amount of referential integrity. The only reason we bought into this architecture at the time was horizontal scalability (not even multi-region). Kafka, sagas, 2PC at the "application layer" can work, but you're going to spend a heck of a lot on engineering.

It was this experience that drove me to Cockroach and I've been spreading the good word ever since.

> If you use an OLTP database, and generate events into Kafka via CDC, you get the best of both worlds.

This is the next chapter in the gospel of the distributed transaction.

gunnarmorling5y ago

>> If you use an OLTP database, and generate events into Kafka via CDC, you get the best of both worlds.

> This is the next chapter in the gospel of the distributed transaction.

Actually, it's the opposite. CDC helps to avoid distributed transaction; apps write to a single database only, and other resources (Kafka, other databases, etc.) based on that are updated asychronously, eventually consistent.

1 more reply

skyde5y ago

> If you use an OLTP database, and generate events into Kafka via CDC, you get the best of both worlds.

100% agree this is the way to go instead of rolling your own transaction support you get the "ACID" for free from the DB and use KAFKA to archive changes and subscribe to them.

skybrian5y ago

It seems like Kafka could be used to record requests and responses (upstream of the database) and approved changes (downstream of the database), but isn't good for the approval process itself?

j-pb5y ago

Hey Arjun, Thanks for taking the time to reply! I didn't mean to suggest that this was Franks opinion, I merely explained to the other user, who seemed to have a more negative attitude towards Materialize, why I am extremely excited about it, namely Frank being the CTO, and him having an extremely good track record in terms of research and code.

I think the general gist of "use an OLTP database as your write model if you don't absolutely know what you're doing" is completely sane advice, however I think there are far more nuanced (and in a sense also honest) arguments that can be made.

I think the architecture you've sketched is over engineered for what you'd need for the task. So here's what I'd build for your inventory example, IFF the inventory was managed for a company with a multi Terra Items Inventory that absolutely NEEDS horizontal scaling:

One event topic in Kafka, partitioned along the ID of the different stores whose inventory is to be managed. This makes the (arguably) strong assumption that inventory can never move between stores, but for e.g. a Grocery chain that's extremely realistic.

We have one writer per store ID partition, which generates the events and enforces _serialisability_, with a hot writer failover that keeps up do date and a STONITH mechanism connecting the two. All writing REST calls / GraphQL mutations for its store-ID range, go directly to that node.

The node serves all requests from memory, out of an immutable Data-structure, e.g. Immutable.js, Clojure Maps, Datascript, or an embedded DB that supports transactions and rollbacks, like SQLite.

Whenever a write occurs, the writer generates the appropriate events, applies them to its internal state, validates that all invariants are met, and then emits the events to Kafka. Kafka acknowledging the write is potentially much quicker than acknowledging an OLTP transaction, because Kafka only needs to get the events into the memory of 3+ machines, instead of written to disk on 1+ machine (I'm ignoring OLTP validation overhead here because our writer already did that). Also your default failure resistance is much higher than what most OLTP systems provide in their default configuration (e.g. equivalent to Postgres synchronous replication).

Note that the critical section doesn't actually have to be the whole "generate event -> apply -> validate -> commit to kafka" code. You can optimistically generate events, apply them, and then invalidate and retry all other attempts once one of them commits to Kafka. However that also introduces coordination overhead that might be better served mindlessly working off requests one by one.

Once the write has been acknowledged by Kafka, you swap the variable/global/atom with the new immutable state or commit the transaction, and continue with the next incoming request.

All the other (reading) request are handled by various views on the Kafka Topic (the one causing the inconsistencies in the article). They might be lagging behind a bit, but that's totally fine as all writing operations have to go through the invariant enforcing write model anyways. So they're allowed to be slow-ish, or have varying QOS in terms of freshness.

The advantage of this architecture is that you have few moving parts, but those are nicely decomplected as Rich Hickey would say, you have use the minimal state for writing which fits into memory and caches, you get 0 lock congestion on writes (no locks), you make the boundaries for transactions explicit and gain absolutely free reign on constraints within them, and you get serialisability for your events, which is super easy to reason about mentally. Plus you don't get performance penalties for recomputing table views for view models. (If you don't use change capture and Materialize already, which one should of course ;] )

The two generals problem dictates that you can't have more than a single writer in a distributed system for a single "transactional domain" anyways. All our consensus protocols are fundamentally leader election. The same is true for OLTP databases internally (threads, table/row locks e.t.c.), so if you can't handle the load on a single 0 overhead writer that just takes care of its small transactional domain, then your database will run into the exact same issues, probably earlier.

Another advantage of this that so far has gone unmentioned is that if allows you to provide global identifiers for the state in your partitions that can be communicated in side-effect-full interactions with the outside world. If your external service allows you to store a tiny bit of metadata with each effect-full API call then you can include the offset of the current event and thus state. That way you can subsume the external state and transactional domain into the transactional domain of your partition.

Now, I think that's a much more reasonable architecture, that at least doesn't have any of the consistency issues. So let's take it apart and show why the general populace is much better served with an OLTP database:

- Kafka is an Ops nightmare. The setup of a Cluster requires A LOT of configuration. Also Zookeper urgh, they're luckily trying to get rid of it, but I think they only dropped it this year, and I'm not sure how mature it is.

- You absolutely 100% need immutable Data-structures, or something else that manages Transactions for you inside the writer. You DO NOT want to manually rollback changes in your write model. Defensive copying is a clutch, slow, and error prone (cue JS, urgh...: {...state}).

- Your write model NEEDS to fit into memory. That thing is the needles eye that all your data has to go through. If you run the single event loop variant, latency during event application WILL break you. If you do the optimistic concurrency variant performing validation checks might be as or more expensive than recomputing the events from scratch.

- Be VERY weary of communications with external services that happen in your write model. They introduce latency, and they break your transactional boundary that you set up before. To be fair OLTPs also suffer from this, because it's distributed consistency with more than one writer and arbitrary invariants which this universe simply doesn't allow for.

- As mentioned before, it's possible to optimistically generate and apply events thanks to the persistent data structures, but that is also logic you have to maintain, and which is essentially a very naive and simple embedded OLTP, also be weary of what you think improves performance vs. what actually improves it. It might be better to have 1 cool core, than 16 really hot ones that do wasted work.

- If you don't choose your transactional domains well, or the requirements change, you're potentially in deep trouble. You can't transact across domains/partitions, if you do, they're the same domain, and potentially overload a single writer.

- Transactional domains are actually not as simple as they're often portrayed. They can nest and intersect. You'll always need a single writer, but that writer can delegate responsibility, which might be a much cheaper operation than the work itself. Take bank accounts as an example. You still need a single writer/leader to decide which account ID's are currently in a transaction with each other, but if two accounts are currently free that single writer can tie them into a single transactional domain and delegate it to a different node, which will perform the transaction and write and return control to the "transaction manager". A different name for such a transaction manager is an OLTP (with Row-Level locking).

- You won't find as many tutorials, if you're not comfortable reading scientific papers, or at least academic grade books like Martin Kleppmanns "Designing Data Intensive Applications" don't go there.

- You probably won't scale beyond what a single OLTP DB can provide anyways. Choose tech that is easy to use and gives you as many guarantees as possible if you can. With change capture you can also do retroactive event analytics and views, but you don't have to code up a write-model (and associated framework, because let's be honest this stuff is still cutting edge, and really shines in bespoke custom solutions for bespoke custom problems).

Having such an architecture under your belt is a super useful tool that can be applied to a lot of really interesting and hard problems, but that doesn't mean that it should be used indiscriminately.

Afterthought; I've seen so many people use an OLTP but then perform multiple transactions inside a single request handler, just because that's what their ORM was set to. So I'm just happy about any thinking that people spend on transactions and consistency in their systems, in whatever shape or form, and I think making the concepts explicit instead of hiding them in a complex (and for many unapproachable) OLTP/RDBMS monster helps with that (if Kafka is less of a monster is another story).

I think it's also important to not underestimate the programmer convenience that working with language native (persistent) data-structures has. The writer itself in its naive implementation is something that one can understand in full, and not relying on opaque and transaction breaking ORMs is a huge win.

PS: Plz start a something collaboratively with ObservableHQ, having reactive notebook based dashboards over reactive Postgres queries would be so, so, so, so awesome!

dwelch23445y ago

Totally agree with this. "Kafka is terrible if you use it with poor practices" should be the title of this. It's totally clickbait targeting the "Database Inside Out" articles / concept, which if you read the most common 3-page high level it explicitly states you should be materializing up into other data sources / structures / systems / etc

To date, I've never written code that reads from a Kafka topic that wasn't taking data and transforming + materializing it into domain intelligence

dkhenry5y ago

I am not terribly surprised. The materialize team was previously at CockroachDB which also had a habit of putting out marketing material like this.

j-pb5y ago

Maybe I'm biased because I'm such a huge Frank McSherry fanboy. The differential dataflow work he does in rust is simply awesome, and he writes great papers too!

Little known fun fact, the rust type- and borrow-checker uses a datalog engine internally to express typing rules, and that engine was written and improved by Frank McSherry. So whenever you hit compile on a rust program, you're using a tiny bit of Materialize tech.

_jal5y ago

I ignored them for a long time because of it. I just assumed they were lightweights being annoying because they didn't have anything else.

It serves as anti-marketing, at least to me.

zaphar5y ago

If it stores data it's a database. Filesystems are databases, MongoDB is a database. LevelDB is a database. Postgres and MySQL are databases. Kafka is a database. They are all very different in features and functionality though.

What the authors mean is that kafka is not a traditional database and doesn't solve the same problems that traditional databases solve. Which is a useful distinction to make but is not the distinction they make.

The reality is that database is now a very general term and for many usecases you can choose to special purpose databases for what you need.

yumaikas5y ago

I think I'd differentiate between a database and a data store.

I'd argue that a filesystem is a data store, rather than a database.

Hamuko5y ago

Some say that Microsoft Excel is the world's most popular database engine.

bastawhiz5y ago

Excel allows you to perform queries for almost arbitrary data on a spreadsheet. I wouldn't doubt for a minute that with sufficient time and motivation you could write a full sql engine with excel functions.

1 more reply

Baeocystin5y ago

Honestly, I'd agree with that, too, at least in a colloquial sense.

victor1065y ago

The most popular database where the number of concurrent users = 1

1 more reply

zaphar5y ago

What is your distinction? Is mongodb a database? What about leveldb?

Whether we want it to be so or not the term database is much more encompassing thank it used to be. You can try to fight that change if you want to but it means you'll be speaking a different language that most of the rest of us as a result.

speed_spread5y ago

I would draw the line at semantic indexing. If the data is only accessible through immediate metadata (filename, sequence number, datetime, etc) then it's just a data_store_. If the engine also maintains indexes derived from the data being stored, then it's a data_base_. In that regard, Kakfa would not be a database since it has no knowledge of what it is storing.

yumaikas5y ago

Honestly, where I'd draw the distinction between a database and a datastore is the ability to do multi-statement transactions. So, a filesystem? A datastore. Some support fsync on a single file, but multi-file sync isn't usually supported.

MongoDB and LevelDB both support transactions, so I'd err on calling them databases myself.

IfOnlyYouKnew5y ago

You're using the term "traditional database" two levels up. Whatever your definition of that is, it's their definition of "database".

Your initial post says "I have a different definition of 'database'", which is different than anyone else's, as you acknowledge when you refer to that common definition as a 'traditional database'.

Then, you fault them for making an argument that is invalid for your non-standard definition of a database!

You would like Kafka, the author.

1 more reply

edgyquant5y ago

I’d say initially file systems were data stores but once they developed hierarchies they became more akin to a database. I’m not sure there’s a huge difference but it seems a database is a collection of data stores (though there or probably a more technical and correct definition.)

Spivak5y ago

I feel like the inventory thing is a bit of a straw-man because the situation is set out in such a way that you need transactions for it to work. If you find yourself wishing you had a global write-lock on a topic to then of course it won't work. Modeling your data for Kafka is work just the same as it is for MySQL. Of course it might not be the best tool for the job but you should at least give it a fair shake.

You should be able to post "buy" messages to a topic without fear that it messes up your data integrity. Who cares if two people are fighting over the last item? You have a durable log. Post both "buys" and wait for the "confirm" message from a consumer that's reading the log at that point in time, validates, and confirms or rejects the buys. At the point that the buy reaches a consumer there is enough information to know for sure whether it's valid or not. Both of the buy events happened and should be recorded whether they can be fulfilled or not.

jacques_chester5y ago

> Who cares if two people are fighting over the last item?

The two people, at least. Customers tend to be a bit underwhelmed by "well, the CAP theorem..." as a customer service script.

Spivak5y ago

None of this shows up as user-facing any differently than a relational database. No CAP theorem at all.

Kafka:

User clicks buy and it shows “processing” which behind the scenes posts the buy message and waits for a “confirmed” message. When it’s confirmed user is directed to success! If someone else posts the buy before them they get back a “failed: sold out” message.

Relational:

User clicks buy and it shows “processing” which behind the scenes tries to get a lock on the db, looks at inventory, updates it if there’s still one available, and creates a row in the purchases table. If all this works the user is directed to success. If by the time the lock was acquired the inventory was zero the server returns “failure: sold out”.

jacques_chester5y ago

The CAP theorem line was smart-arsery.

The thing here is that the database can update the cart and the inventory in one logical step, to the exclusion of others.

The Kafka approach doesn't guarantee that out of the box, leading to the creation of de facto locking protocols (write cart intent, read cart intent, write inventory intent ...). A traditional database does that for you with selectable levels of guarantees.

2 more replies

je425y ago

Ok. I admit using Kafka as DB is not straight forward but just stating it doesn't provide ACID functionality is not enough.

The example they give is very simplistic. With the correct design of kafka topics and events the problem of the example can be fixed.

And according to oracle https://www.oracle.com/database/what-is-database/ :

> A database is an organized collection of structured information, or data, typically stored electronically in a computer system.

So Kafka clearly fits that definition.

dathinab5y ago

Didn't some newspaper use Kafka to store the newspapers they released in order or something similar? (I think it was the New York Times, maybe??).

Honestly as long as you don't use it as a general purpose database, it might very well be the best choice for your use-case.

je425y ago

Indeed. Known the tools and apply the one that has least cons and most pros ;).

Also, a good starter to for knowing how to use a Streaming Store like Kafka as DB is the video Database inside-out. https://www.youtube.com/watch?v=fU9hR3kiOK0

barnabask5y ago

2017 article: https://open.nytimes.com/publishing-with-apache-kafka-at-the...

lmm5y ago

I'd actually say Kafka makes a better general-purpose database. SQL databases are only appropriate for a pretty narrow range of problems.

satisfaction5y ago

I'm guilty of using it as a DB, my home weather station writes to kafka topics, sometimes the postgres instance is down for months, no problems letting the kafka topic store the data until i get around to rebooting pg and restarting the connector.

halbritt5y ago

The author presumes that every use case requires a transactional database. ACID is nice, especially if it's needed, but generally not needed, especially in many streaming data applications for which Kafka is most suitable.

tacitusarc5y ago

I think because software engineers tend to excel at pattern recognition, oftentimes solutions to different problems appear so similar that it seems like with a small amount of abstraction, they can be reused. But it's a trap!

Everything abstracted to the highest level is the same, but problems aren't solved at the highest level.

The devil, as they say, is in the details.

UK-Al055y ago

This is lack of abstraction. You can certainly fix this in kafka using various hacks, but its implementation of an abstraction you can get in a standard db for free.

Funnily enough a list of events is pretty much what a transaction log is in a standard db. Although the events have more of a business meaning. In many ways event sourcing is removing a lot of abstraction databases give you.

mrkeen5y ago

> a standard db

Yes, ACID works for one database. Many databases? Not so much.

> Funnily enough a list of events is pretty much what a transaction log is in a standard db.

When I "SELECT Balance WHERE user = 12345", I usually just get back a balance, I don't get back the transaction log.

If nothing else, adopting the Kafka model gets your teammates to append updates to your ledger, rather than changing values in-place.

UK-Al055y ago

Yes that my point. You get half of a database. The transaction log. Not the up to date state.

UK-Al055y ago

One way around this is to make sure your kafka command streams are processed in order, in serial partitioned by an id where you want the concurrency control.

Normally you only want concurrency control within certain boundaries.

By figuring out the minimum amount transaction and concurrency boundaries you can inch out quite a bit of performance.

_y5hn5y ago

Sure, but that defeats the quest for horizontal scalability. You can build highly performant systems based on serial execution, but not sure this is an area where Kafka excels particularly.

UK-Al055y ago

That's why you partition by some id. Say stock SKU id for stock control. Then you can handle other SKUs in parallel. It's only in serial for a single SKU. That's probably the maximum performance potential your going to get in a traditional db anyway.

edbrown235y ago

This definitely seems like the "Kafka" way to solve this problem, but I fear there are implications to this partitioning scheme I'd love to see answered. For example, partition counts aren't infinite, and aren't easily adjusted after the fact. So if you choose, say, 10 partitions originally, for a SKU space that is nearly infinite, then in reality you can only handle 10 parallel streams of work. Any SKU that is partitioned behind a bit of slow work is then blocked by that work.

It's doable to repartition to 100 partitions or more, but you basically need to replay the work kept in the log based on 10 partitions onto the new 100 partitions, and that operation gets more expensive over time. Then of course you're basically stuck again once your traffic increases to a high enough level that the original problem returns. If the unit of horizontal scaling is the partition, but the partition count can't be easily changed, consumers eventually lose their horizontal scalability in Kafka, from my perspective.

1 more reply

jacques_chester5y ago

This strikes me as mixing the physical and logical models.

1 more reply

shay_ker5y ago

This is maybe a silly question, but what's the difference between the timely dataflow that Materialize uses and Spark's execution engine? From my understanding they're doing very similar things - break down a sequence of functions on a stream of data, parallelize them on several machines, and then gather the results.

I understand that the feature set of timely dataflow is more flexible than Spark - I just don't understand why (I couldn't figure it out from the paper, academic papers really go over my head).

frankmcsherry5y ago

There are a few differences, the main one between Spark and timely dataflow is that TD operators can be stateful, and so can respond to new rounds of input data in time proportional to the new input data, rather than that plus accumulated state.

So, streaming one new record in and seeing how this changes the results of a multi-way join with many other large relations can happen in milliseconds in TD, vs batch systems which will re-read the large inputs as well.

This isn't a fundamentally new difference; Flink had this difference from Spark as far back as 2014. There are other differences between Flink and TD that have to do with state sharing and iteration, but I'd crack open the papers and check out the obligatory "related work" sections each should have.

For example, here's the first para of the Related Work section from the Naiad paper:

> Dataflow Recent systems such as CIEL [30], Spark [42], Spark Streaming [43], and Optimus [19] extend acyclic batch dataflow [15, 18] to allow dynamic modification of the dataflow graph, and thus support iteration and incremental computation without adding cycles to the dataflow. By adopting a batch-computation model, these systems inherit powerful existing techniques including fault tolerance with parallel recovery; in exchange each requires centralized modifications to the dataflow graph, which introduce substantial overhead that Naiad avoids. For example, Spark Streaming can process incremental updates in around one second, while in Section 6 we show that Naiad can iterate and perform incremental updates in tens of milliseconds.

shay_ker5y ago

That's very helpful, thanks! I think I still have to wrap my head around _why_ being stateful allows TD to respond faster, but maybe I just gotta dig deeper and see for myself.

frankmcsherry5y ago

It just comes down to something as simple as: "if I have shown you 1M different things, and now show you one more thing, what do you have to do to tell me whether that one thing is new or not?"

If you can keep a hash map of the things you've seen, then it is easy to respond quickly to that one new thing. If you are not allowed to maintain any state, then you don't have a lot of options to efficiently respond to the new thing, and most likely need to re-read the 1M things.

That's the benefit of being stateful. There is a cost too, which is that you need to be able to reconstruct your state in the case of a failure, but fortunately things like differential dataflow (built on TD) are effectively deterministic.

Also, I suspect "Spark" is a moving target. The original paper described something that was very much a batch processor; they've been trying to fix that since, and perhaps they've made some progress in the intervening years.

1 more reply

astn-austin5y ago

You should check out Apache Flink. It does a bunch of those things that Spark doesn't, though it's also missing a few things that Spark has. https://flink.apache.org/

mamon5y ago

There's no difference really. All "Big Data" (tm) tools are trying to capitalize on the hype, so Kafka adds database capabilities, while Spark adds Streaming. At some point they will reach feature parity.

fredliu5y ago

Kafka is essentially commit logs, which are at the core of any traditional database engines. Streaming is just turning the gut of DB inside out (mostly for scalability reasons), while DB is wrapped up commit logs that provides higher level functionalities (ACID, Transactions, etc.). It's two sides of the same coin, yin and yang of the same thing... But on the practical side of things, yes, if what you needed more are indeed what's described in this article, your life would be easier with a traditional DB.

jdmichal5y ago

So, the problem really being addressed but not named is that eventing systems give eventual consistency. But sometimes that's not good enough. And it's OK to admit that and bring in another technology when you need a stronger guarantee than that.

The example I was taught with was a booking system, where the inventory management system-of-record was separate from the search system. Search does not need 100% up-to-date inventory. A delay between the last item being booked and it being removed from the search results is acceptable. In fact, it has to be acceptable, because it can happen anyway. If someone books the last item after another hit the search button... There's nothing the system can do about that.

When actually committing a booking, however, then that must be atomically done within the inventory management system.

So, to bring it home, it's OK for the search system to be eventually consistent against bookings, and read bookings off of an event stream to update its internal tracking. However, the bookings themselves cannot be eventually consistent without risking a double-booking.

jkarneges5y ago

Another potential misuse of Kafka I've been wondering about is how a single Kafka instance/cluster is often shared by multiple microservices.

On one hand the ability to connect multiple microservices to a central message broker is convenient, but on the the other hand this goes against the microservice philosophy of not sharing subcomponents (databases, etc). I wonder where the lines should be drawn.

mumblemumble5y ago

I would argue that, if it's being used properly, the message broker itself is a service. It runs as a separate process, you communicate with it over an API, and its subcomponents (e.g., the database) are encapsulated.

It's all about framing and perspective, of course. But that's how I'd want to try and frame it from a system architecture point of view.

wongarsu5y ago

By that same reasoning Postgres is it's own micro service. It runs as a separate process, you communicate with it over a well defined API, and it's subcomponents (data store, query optimizer etc) are encapsulated.

With enough framing everything is possible, and in some contexts it will even make sense.

mumblemumble5y ago

It has to do with how it functions in practice, IMO. PostgreSQL itself is arguably a service, but the database probably is not - you're probably crawling all over its implementation details and data model.

You could take a stand and say, "All access is through stored procedures. They are the API." And, if that API operates as the same semantic level as a well-crafted REST API, then you could make an argument that that particular database is a microservice that just happens to have been implemented on top of PostgreSQL. But I don't think I've ever seen such a thing happen in the wild. It's much more popular to use an ORM to get things nice and tightly coupled.

1 more reply

ivalm5y ago

Wait, what? Isn’t the whole point of having multiple publishers/subscribers?

mrweasel5y ago

I think the point was about using a single cluster for multiple topics, for different services.

Depending on the scenario I can see the point. If the micro services are all part of the larger overall solution, having a single cluster is perfectly fine. Using the same cluster for multiple "product" is a little like having one central database server for a number of different solutions. You can do it, but it potentially become a bottleneck or a central point for your different solutions to impact performance of each other.

jkarneges5y ago

I'd agree there is an arguable difference between sharing a server vs sharing data within the server.

Bottleneck issues aside, letting two microservices connect to the same Postgres cluster but access different "databases" (collection of tables) within that cluster could be considered an acceptable data separation. Certainly with multi-tenant DBaaS systems there may be some server sharing by unrelated microservices/customers. Whereas letting two microservices access the same database tables would probably be frowned upon.

Nevertheless, sharing the same Kafka topics between microservices seems to be a common thing to do.

1 more reply

onefuncman5y ago

this isn't any different than microservices getting a deathball dependency on a user service or logging service or security service or ...

you either don't allow microservices to consume from others' topics, or you publish event schemas so they can still iterate independently.

the move to a 2nd kafka cluster in my experience has always been driven by isolation and fault tolerance concerns, not scalability.

Cojen5y ago

Jim Gray disagrees: https://arxiv.org/ftp/cs/papers/0701/0701158.pdf

georgewfraser5y ago

It's funny that you use that example, we actually cited that in an earlier draft of this post. Despite the seemingly opposite title "Queues are Databases", that note actually makes many of the same arguments, that message brokers are missing much of the functionality of database management systems and this is a problem.

soumyadeb5y ago

The architecture of dumping events into Kafka and creating materialized views is a perfect choice for many use cases - e.g. collecting clickstream data and building analytical reports.

If ACID is a prerequisite, then lot of things won't classify as databases - None of Mongo, Cassandra, ElasticSearch etc. Not even many data-warehouses.

Kalium5y ago

As recently as last year, I worked for a company where the Chief Architect, in his infinite wisdom, had decided that a database was a silly legacy thing. The future looked like Kafka streams, with each service being a function against Kafka streams, and data retention set to infinite.

Predictably, this setup ran into an interesting assortment of issues. There were no real transactions, no ensured consistency, and no referential integrity. There was also no authentication or authorization, because a default-configured deployment of Kafka from Confluent happily neglects such trivial details.

To say this was a vast mess would be to put it lightly. It was a nightmare to code against once you left the fantasy world of functional programming nirvana and encountered real requirements. It meant pushing a whole series of concerns that isolation addresses into application code... or not addressing them at all. Teams routinely relied on one another's internal kafka streams. It was a GDPR nightmare.

Kafka Connect was deployed to bridge between Kafka and some real databases. This was its own mess.

Kafka, I have learned, is a very powerful tool. And like all shiny new tools, deeply prone to misuse.

JMTQp8lwXL5y ago

Instead of single architects, companies need architect boards. And they can vote on these ideas before a single individual becomes a single point of failure.

Expecting 1 person to make 100% correct decisions all the time is too much expectation for one person. People go down rabbit holes and they have weird takeaways, like replace all the databases with queues.

Kalium5y ago

I agree in abstract, but in practice it's quite difficult to set up a successful democratic architecture board. You need teams or departments that all have architects, and an engineering organization where both managers and engineers accept a degree of centralized technical leadership. Getting there is, in my opinion, the work of years. It's especially challenging because spinning up such a board requires a person who can run it single-handedly.

In this particular company, the Chief Architect in theory had a group around him. They did nothing to check his poor decisions, and from the outside seemed primarily interested in living in the functional programming nirvana he promised them.

sixdimensional5y ago

Thinking in terms of building physical buildings.. architects are often the visionaries but engineers are the realists. You know, architects come up with these crazy incredible building designs based on their engineering understanding, but ultimately it has to be vetted, proven and implemented by engineers.

I have, however, always wondered why people should be seen as "only architects" and "only engineers". While the separation of duties is critical to ensure the overall construction is sound, people can be visionary engineers, and people can be knowledgeable in both how to do something in real terms as well as dreaming on how to go beyond.

sixdimensional5y ago

It sounds like this would have made a better proof of concept than an commitment to the architecture.

The idea on the face of it is not per se a bad one, quite interesting, but the implementations are perhaps not there yet to back such an idea.

It's important to know as an architect when your vision for the architecture is outpacing reality, to know your time horizon, and match the vision with the tools that help you implement the actual use cases you have in your hand right now.

It sounds like this person might have had an interesting idea but not a working system. In another light, this could have been a good idea if all the technology was in place to support it.. but the timing and implementation doesn't sound like it was right, perhaps.

The old saying "use the right tool for the job" comes to mind, but that can be hard to see when the tools are changing so fast, and there is a risk to going too far onto the bleeding edge. Perhaps the saying should have been, "use the rightest tool you can find at the time, that gives you some room to grow, for the job"...

Kalium5y ago

It was definitely an interesting proof of concept that needed some refinement. The core idea was functional services against nicely organized data streams on a flat network. Which is a really cool approach that works quite well for a lot of things.

Several of these points fell apart when credit card handling and PCI-DSS entered the picture.

EamonnMR5y ago

At our company we use a ton of services that operate essentially as as functions on a Kafka stream (well, they tend to read/write in batches for efficiency) but we write event streams we want to query later into a regular database for later query. It works out very well. The idea of our poor Kafka cluster having to field queries in addition to the load of acting as a transport layer is frightening. The 'superpower' Kafka gives you is the ability to turn back time if something goes wrong and the ability to orchestrate really big pipeline. You have to build or buy a fair bit of tooling to make it work though.

Kalium5y ago

This particular org did its level best to think of Kafka as the regular database for queries.

lmm5y ago

> There were no real transactions, no ensured consistency

Which is the right way to do it, because transactions don't extend into the real world. If you need to wait for the consequences of a given event, wait for the consequences of that event. Otherwise, all you really care about is all events happening in a consistent order. It's a much more practical consistency model.

> and no referential integrity

The problem with enforcing referential integrity is how you handle violations of it. Usually you don't really want to outright reject something because it refers to something else that doesn't exist yet, so you end up solving the same problem either way.

> There was also no authentication or authorization, because a default-configured deployment of Kafka from Confluent happily neglects such trivial details.

Pretty common in the database world - both MySQL and PostgreSQL use plaintext protocols by default. Properly-configured kafka uses TLS and/or SASL and has a good ACL system and is as secure as anything else.

> It was a nightmare to code against once you left the fantasy world of functional programming nirvana and encountered real requirements. It meant pushing a whole series of concerns that isolation addresses into application code... or not addressing them at all.

My experience is just the opposite - ACID isolation sounds great until you actually use it in the real world, and then you find it doesn't address your problems and doesn't give you enough control to fix it yourself. It's like when you use one of those magical do-everything frameworks - it works great until you need to customise something slightly, then it's a nightmare. Kafka pushes more of the work onto you upfront - you have to understand your dataflow and design it explicitly - but that pays off immensely.

> It was a GDPR nightmare.

Really? I've found the exact opposite - teams that used an RDBMS had to throw away their customer data under GDPR, because even though they had an entry in their database saying that the customer had agreed, they couldn't tell you what the customer had agreed to or when. Whereas teams using Kafka in the way you describe had an event record for the original agreement, and could tell you where any given piece of data came from.

Kalium5y ago

> Really? I've found the exact opposite - teams that used an RDBMS had to throw away their customer data under GDPR, because even though they had an entry in their database saying that the customer had agreed, they couldn't tell you what the customer had agreed to or when. Whereas teams using Kafka in the way you describe had an event record for the original agreement, and could tell you where any given piece of data came from.

This is absolutely wonderful! Unfortunately, this team decided to store data subject to GDPR deletion requests in Kafka, where deletion is quite difficult. It was a problem, when trying to do deletion programmatically, across many teams using the same set of topics.

The real nightmare came when this team, obsessed with the power of infinite retention periods, encountered PCI-DSS. You see, the business wanted to move away from Stripe and similar to dealing with a processor directly, in order to save on transaction fees. So obviously they could just put credit card data into Kafka...

lmm5y ago

Yeah fair enough. I'd argue that this is kind of a double standard (a traditional RDBMS may well be copies of "deleted" data on dirty pages, and may well leave that data on the physical disk indefinitely, for much the same reasons as Kafka does - it just makes it a bit fiddlier for you to access it), but your legal team may decide that it's required.

I don't think your overall scorn is warranted - there are bigger problems that are endemic to RDBMS deployments, and the advantages of a stream-first architecture are very real - but there are genuine difficulties around handling data obliteration and it's something you need to design for carefully if you're using an immutable-first architecture and have that requirement.

hodgesrm5y ago

Is this really a thing? Do people really try to use Kafka as the system of record for financial transactions or similar data?

mrkeen5y ago

Hell yes. Best thing I ever did.

Updating balances using an RDBMS is like managing your finances with pencils and erasers. Unless you somehow ban the UPDATE statement.

Updating balances with Kafka is like working in pen. You can't[1] change the ledger lines, you can only add corrections after the fact.

[1] Yes, Kafka records can be updated/deleted depending on configuration. But when your codebase is written around append-only operations, in-place mutations are hard and corrections are easy, so your fellow programmers fall into the 'pit of success'.

jacques_chester5y ago

The original sin of SQL is that it begins without memory.

The original sin of Kafka is that it begins with only memory.

To me the right middle way is relational temporal tables[0]. You get both the memoryless query/update convenience and the ability to travel through time.

[0] SQL:2011 introduced temporal data, but in a slightly janky just-so-happens-to-fit-Oracle's-preference kind of way.

sixdimensional5y ago

Blockchain databases take this model to the extreme - it is a database that is append only and immutable, with cryptographic guarantees of the ledger.

I'm not selling any particular tool but Amazon's QLDB is an interesting example of a blockchain-based database. I am interested to see how things like Kafka and this might come together somehow.

In my opinion, we have evolved to a point where storage is not a concern for temporal use cases - i.e. we can now store every change in an immutable fashion. When you think about this being an append only transaction log that you never have to purge, and you make observable events on that log (which is what most CDC systems do)... yeah it works. Now you have every change, cryptographically secure, with events to trigger downstream consumers and you can really rethink the whole architecture of monolithic databases vs. data platforms.

It is an exciting time we are in, in my opinion.

hodgesrm5y ago

I would just put the ledger in a database table if it's that important and maintain the current state of the account in a separate table. ACID transactions and database constraints make this kind of consistency easier to achieve than many alternatives. It's also easier to prove correctness since you can run queries that return consistent results thanks to the isolation guaranteed by ACID. (Modulo some corner cases that are not hard to work around.)

Just my $0.02.

mrkeen5y ago

> ACID transactions and database constraints make this kind of consistency easier to achieve than many alternatives.

If your company only runs one database.

> It's also easier to prove correctness

RDBMSs are wonderful and I don't consider them unreliable at all. But I can't prove the correctness of my teammate's actions. I want them to show their working by putting their updates onto an append-only ledger.

1 more reply

jgraettinger15y ago

This post doesn't mention the _actual_ answer, which is to:

1) Write a event recording a _desire_ to checkout. 2) Build a view of checkout decisions, which compares requests against inventory levels and produces checkout _results_. This is a stateful stream/stream join. 3) Read out the checkout decision to respond to the user, or send them an email, or whatever.

CDC is great and all, too, but there are architectures where ^ makes more sense than sticking a database in front.

Admittedly working up highly available, stateful stream-stream joins which aren't challenging to operate in production is... hard, but getting better.

fuckadtech5y ago

Hard is an understatement. Particularly so if you are using Kafka Streams to attempt to run a highly available, fault tolerant, zero downtime, etc., service. The race condition and compaction bugs in that library are not fun to debug.

vladsanchez5y ago

I want to Upvote this more than once. So much facts into a condensed into a small essay. Good job!

Money quote: "Event-sourced architectures like these suffer many such isolation anomalies, which constantly gaslight users with “time travel” behavior that we’re all familiar with."

rowland665y ago

Maybe I am just old because I had to Google what gaslighting meant, but as best I can tell getting gaslighted by your system architecture is really stretching the meaning of the term gaslighting to the point of absurdity.

vladsanchez5y ago

I concur, but we both got it's a metaphor. ;)

theptip5y ago

This is a bit dumbed down, and ignores the domain terminology required to properly discuss the trade-offs here (which is puzzling given that it links to a post by Aphyr, where you can find incredibly thorough discussions around isolation levels and anomalies).

> The fundamental problem with using Kafka as your primary data store is it provides no isolation.

This is false. I can only assume the author doesn't know about the Kafka transactions feature?

To be specific, Kafka's transaction machinery offers read-committed isolation, and you get read-uncommitted by default if you don't opt-in to use that transaction machinery (the docs: https://kafka.apache.org/0110/javadoc/index.html?org/apache/...). Depending on your workload, read-committed might be sufficient for correctness, in which case you can absolutely use Kafka as your database.

Of course, proving that your application is sound with just read-committed isolation is can be challenging, not to mention testing that your application continues to be sound as new features are added.

Because of that, in general I think that the underlying point of this article is probably correct, in that you probably shouldn't use Kafka as your database -- but for certain applications / use-cases it's a completely valid system design choice.

More generally this is an area that many applications get wrong by using the wrong isolation levels, because most frameworks encourage incorrect implementations by their unsafe defaults; e.g. see the classic "Feral concurrency control" paper http://www.bailis.org/papers/feral-sigmod2015.pdf. So I think the general message of "don't use Kafka as your DB unless you know enough about consistency to convince yourself that read-committed isolation is and will always be sufficient for your usecase" would be more appropriate (though it's certainly a less snappy title).

georgewfraser5y ago

"Read-committed isolation" is not a meaningful implementation of transactions. If you can't do read, then a write, while guaranteeing the database didn't change in between, then you don't really have transactions.

theptip5y ago

Depends on your use-case; if it's meaningless, why is it implemented in all the leading SQL DBs? It's the default in Postgres...

https://www.postgresql.org/docs/9.5/transaction-iso.html

If you're arguing that in practice this isn't enough isolation, then sure, that's what I said in my post; most applications need more than the default isolation levels. I feel like you're making an absolutist point (just like the original article) where my point was that the domain is actually more nuanced, and absolutes just obscure the technical complexity.

LgWoodenBadger5y ago

This sounds like "serializable" which is (in my experience) rarely useful for a meaningful system.

theptip5y ago

If you read the "Feral concurrency control" paper I linked above, particularly section 7 on conclusions, they make the case that serializable is the only isolation level that's actually safe with naive coding styles on frameworks like Rails and Django which do application-level validation. If you do validation in your application and don't use serializable isolation, then you have to be careful about manually locking, OR just be sure that your usage pattern isn't vulnerable to the anomalies that you're introducing by using a weaker isolation level.

If you're building a financial ledger, you absolutely must use serializable isolation. If you're building a Twitter clone, sure, use something weaker that will gain you some performance.

I'd make the case that we should be recommending the use of serializable by default unless you have a reason why you think it's OK to use something weaker, rather than having the default be better-performing-but-unsafe. The sort of concurrency validation errors that you get if you needed Serializable and used Read-Committed instead are really, really hard to reproduce, debug, and diagnose.

isbvhodnvemrwvn5y ago

It's not strictly speaking serializable, what they described is a lost update anomaly, to avoid which repeatable read or snapshot isolation is sufficient.

Serializable (or serializable snapshot isolation) is stronger, it doesn't allow anomalies such as write skew, but also a lot more expensive since you need to keep track of changes in rows matched by predicates to avoid phantom reads (as compared to just keeping track of the rows returned by the query with predicates in this particular transaction).

Also worth noting that some DBs such as Oracle actually lie about implementing serializable (in this case they only offer snapshot isolation), so it's worth keeping that in mind and use locks if necessary.

lmm5y ago

> The problem we now have is called write skew. Our reads from the inventory view can be out of date by the time the checkout event is processed. If two users try to buy the same item at nearly the same time, they will both succeed, and we won’t have enough inventory for them both.

And you'll have exactly the same problem if you're using a traditional ACID database: the user saw the item as being available, clicked buy, but it was unavailable by the they went to get it. Using an ACID database doesn't gain you anything; you might as well just use Kafka for everything.

bastawhiz5y ago

The A in ACID literally stands for atomicity. If you're using an ACID database that can't guarantee that an item is available for purchase at the time it's purchased, you're using a bad database.

The user having stale data in their browser and finding an item has already been purchased is very different from the database, within a transaction, allowing an item which has already been purchased to be purchased again.

lmm5y ago

> The user having stale data in their browser and finding an item has already been purchased is very different from the database, within a transaction, allowing an item which has already been purchased to be purchased again.

This is like "the operation was a success, but the patient died". What matters is the user-facing behaviour of your whole system; a transaction that can't actually cover the parts the user cares about is pointless.

bastawhiz5y ago

You're misunderstanding the problem. If I press "buy now" and it says "success", I don't want to receive an email later saying "oops, we didn't actually have any in stock, we'll send you a refund" because another order was sitting in a queue. That's the failure mode here. The staleness of what's in the browser is an orthogonal concern unrelated to the database or use of Kafka.

1 more reply

ben5095y ago

The ACID system can guarantee nobody is billed for an item that you can't deliver. If you want a more user-friendly guarantee, you can reserve it when it's added to the cart.

lmm5y ago

> If you want a more user-friendly guarantee, you can reserve it when it's added to the cart.

If you open an ACID transaction when the user adds something to the cart and don't close it until they check out, you'll find your database gets locked up pretty quickly. So you can't actually use the ACID transactions to implement the behaviour you want - you have to implement some kind of reserve/commit semantics in userspace, whether you're using an ACID database or not.

ben5095y ago

You don't need to hold the transaction open the entire time, you just need the inventory count to be correct.

You track the inventory and reservations. Taking a reservation checks that inventory is available. With row-level locking, only that inventory item is locked. If that fails, it can search for timed out reservations, update the inventory and try again.

If it succeeds, it decrements the inventory then adds a reservation. At that point, the transaction can close, and in the common case, you only held locks long enough to update a row in inventory and add a row to reservations.

1 more reply

fouc5y ago

Any sufficiently complex software will end up implementing a database.

dgb235y ago

The article links to this talk[0], which has a funny and interesting sounding title: "Did you accidentally build a database?"

[0] https://www.oreilly.com/library/view/strata-hadoop/978149194...

revertts5y ago

That link's a 3min clip for non-subscribers, but the full talk is here https://www.youtube.com/watch?v=Bz2EXg0Fy98

quickthrower25y ago

I’d qualify that as “complex infrastructure software”

arthurcolle5y ago

I had an issue with RabbitMQ where I didn't know how my consumer was going to use the data that I was writing to a queue yet (from a producer that was listening on a SocketIO or WebSockets stream), and I was kind of just going to figure it out in an hour or something.

Eventually, my buffer ran out of memory and I couldn't write anything else to it, and it was dropping lots of messages. I was bummed. Is there a way to avoid this in Kafka?

atmosx5y ago

RabbitMQ and Kafka serve different needs.

RabbitMQ is the most widely used open source implementation of the AMQP protocol. It is slower but can support complex routing scenarios internally and handle situations were at-least-once-delivery guarantees are important. RabbitMQ supports on-disk persistent queues, which you can tune if you like. Compared to Kafka, RabbitMQ is slow in terms of volume that can be managed per queue.

Kafka is fast because it is horizontally scalable and you have parallel producers and consumers per topic. You can tune the speed and move needle where you need between consistency and availability. However, if you want things like at-least-once-delivery and such, you'll have to use the building blocks kafka gives you, but ultimately you'll have to handle this on the application side.

Regarding storage, by default kafka stores data for 7 days. IIRC the NY Times stores articles from 1970 onwards on kafka clusters. The storage is horizontally scalable and durable. This is a common use case. As many have pointed out, the cluster setup depends highly on you needs. We store data for 7 days in kafka as well and it's in the order of 500GB or more per node.

Looks like you have a configuration issue. You can configure rabbitMQ to store queues on the hard disk and with a quick calculation you can make sure you have enough space for 10 or 150 hours of data. I don't see any reason to switch to kafka, a different tool with different characteristics, just because you need more storage.

amai5y ago

Elasticsearch is also not a database.

charlieflowers5y ago

Could you elaborate on that? Because it sure seems like a DB to me.

Not a relational one. And should not replace a typical CRUD OLTP DB.

But it sure seems like a no-sql DB to me.

amai5y ago

Elasticsearch is a search engine. It is optimised to return the best fitting results to a query at the first page. But if you want to retrieve all results for a query (which is a common usecase for DBs) the performance of elasticsearch will plummet.

For example: Try to retrieve a list of all the 10000 movies you stored in an Elasticsearch index. You will get the first 100 results easily, but is you scroll through the results, you will notice that elasticsearch will become very slow.

It is not optimized for that use case. Otherwise it would be called ElasticDB.

diehunde5y ago

Relevant to the discussion:

Martin Kleppmann | Kafka Summit SF 2018 Keynote (Is Kafka a Database?) [1]

[1] https://www.youtube.com/watch?v=v2RJQELoM6Y

based25y ago

https://www.postgresql.org/docs/10/rules-materializedviews.h...

EamonnMR5y ago

Kafka is a very nice communication channel. You can dump the results into a database and query it if you need a database.

tutfbhuf5y ago

Well, then you have never heard of ksqlDB. It adds SQL and DB features to Kafka. It is backed by Confluent (LinkedIn) same company that developed Kafka initially.

https://ksqldb.io

albertwang5y ago

We’re very aware of ksqlDB. I would recommend this video from last week where Matthias talks about some of the strengths and weaknesses of ksql: https://youtu.be/KUQuegJ4do8

cirego5y ago

My understanding is that ksqlDB is a read-only interface on top of streams and only helps people write better consumers. The problems mentioned in the blog post relate to producers.

somurzakov5y ago

How does using Kafka come into play when implementing Actor model?

Anyone successfully implemented actor model framework over kafka?

interested in learning others' experience

joking5y ago

neither has to be.

detay5y ago

using the any tool for correct problem requires skills.

hasanic5y ago

I mean, duh? Does Apache Kafka ever made the claim that it is a database?

Other things that are not a database: Apache Traffic Server, Apache Mahout, Apache Jakarta, Apache ActiveMQ... hundreds of these exist.

dang5y ago

The title sounds generic, but the article makes it clear that it's responding to specific proposed uses of Kafka.

j / k navigate · click thread line to collapse