undefined | Better HN

0 pointsgeorgewfraser7y ago0 comments

Super interesting, I admit I haven't spent much time understanding Kafka---it seems like it's almost a hybrid between a message queue and a database?

0 comments

oskari7y ago

Kafka allows clean decoupling of event producing and consuming logic unlike typical message queues. Your clients can keep pushing new messages to Kafka's persistent logs without having to know who and what will process those messages.

This makes it possible to easily update your event processing logic or add new components to the mix while maintaining a clean architecture.

I gave a talk about this at PostgresConf US a few weeks ago, the talk's not yet available as an article, but you can get the slides from https://aiven.io/blog/aiven-talks-in-postgresconf-us-2018/ if you're interested.

wenc7y ago

Yes. Kafka is a horizontally-scalable persistent queue (among other things). This sounds kinda pedestrian (like something out of a computer science textbook), but people are rapidly discovering uses for it as a systems architecture component.

The concept of "stream-table duality" tells us tables can be thought of as a materialized view of change data operations [INSERT/UPDATE/DELETE] streams. Kafka can be used as a buffer for streaming data that can materialized into a relational table at any time.

One of the more interesting use-cases is multi-target replication: feed your change-data-capture data into Kafka, and replay it on any other backend data store (SQL, Graph Database, NoSQL, etc.)[1]

Conceptually this lets you ingest data into a stream, write the data on multiple backends and keep everything in sync. Martin Klepmann has written a simple (PoC-quality) tool for doing this with Postgres databases.

[1] https://www.confluent.io/blog/bottled-water-real-time-integr...

lbradstreet7y ago

Yes, I like to think of it as as the transaction log without any of the logic. If you’re interested, I can highly recommend Martin Kleppmann’s talk “Turning the database inside out” https://www.youtube.com/watch?v=fU9hR3kiOK0

manigandham7y ago

I thought you used it at Fivetran? It's a logging system where you send messages as a key/value of bytes to a topic, which is just a logical group of partition files spread over the nodes as master + replicas.

Consumers read from the topic (the underlying partitions) and maintain the offset last read themselves, allowing for easy replay, strict ordering within a single partition, and completely disconnected pacing from producers. The optional "key" for each message allows for compaction so only the latest key/value pair is kept in a topic.

It's definitely not a database but works well for replicating them, as well as doing most of the work of a typical message queue / service bus system.

georgewfraserOP7y ago

Nope, no Kafka at Fivetran, nearly all the sources we pull from are intrinsically batch, support retry, and come out of the source already partitioned by user. Our internal architecture is basically just a farm of batch processors.

We have been hearing more and more about people using Kafka to support streaming analytics. I haven't spent a ton of time studying it, I was under the impression that it was just a giant queue, but ordering + retention is basically a database in the way I think about it.

dvlsg7y ago

If consumers need to keep track of their own offsets, do you run into complications when trying to run multiple instances of the same consumer? Or do you typically run single processes when consuming a specific topic?

manigandham7y ago

This is what partitions are for. They are used to scale out both publishing and consumption by assigning each consumer a single partition. This logic used to be complicated client-side code but now is included into Kafka itself, along with automatic offset saving on some interval (saved as entries in another topic).

Also consumers can be in a "consumer group" so you can have multiple clusters of consumers each reading the entire log separately, but shared within each cluster.

Also I'd recommend looking at Apache Pulsar for a next-generation architecture that combines Kafka's log semantics with the low-latency routing and individual message acknowledgements of message queues: https://pulsar.incubator.apache.org/

asavinov7y ago

> it seems like it's almost a hybrid between a message queue and a database?

If Kafka supported data processing (queries or whatever) then it would be much closer to databases. Also, databases are normally aware of the structure of the data (for example, columns).

Therefore, Kafka can hardly be viewed as a DBMS because it explicitly separates two major concerns:

* data management - how to represent data (Kafka)

* data processing - how to derive/infer new data (Kafka Streams - a separate library)

Theoretically, if they could combine these two layers of functionality in one system then it would be a database.

Ezku7y ago

For me, the best way to think about Kafka is as a distributed log. That said, it does feature rather robust transformations of log entries - does KSQL count as ”queries or whatever”? https://www.confluent.io/product/ksql/

techno_modus7y ago

As far as I understand, KSQL is not integral part of Kafka - it is based on Kafka Streams (an independent library). So Kafka is not one system and hence some ambiguity when referring to its functionalities. In particular, "Kafka is as a distributed log" means only Kafka core, not Kafka Streams and KSQL.

jto12187y ago

love this article about the log abstraction and some of the motivation behind Kafka. It's a great resource to get a feel for what it is and how you can use it:

https://engineering.linkedin.com/distributed-systems/log-wha...

j / k navigate · click thread line to collapse

0 comments

oskari7y ago

This makes it possible to easily update your event processing logic or add new components to the mix while maintaining a clean architecture.

wenc7y ago

One of the more interesting use-cases is multi-target replication: feed your change-data-capture data into Kafka, and replay it on any other backend data store (SQL, Graph Database, NoSQL, etc.)[1]

[1] https://www.confluent.io/blog/bottled-water-real-time-integr...

lbradstreet7y ago

manigandham7y ago

It's definitely not a database but works well for replicating them, as well as doing most of the work of a typical message queue / service bus system.

georgewfraserOP7y ago

dvlsg7y ago

manigandham7y ago

Also consumers can be in a "consumer group" so you can have multiple clusters of consumers each reading the entire log separately, but shared within each cluster.

asavinov7y ago

> it seems like it's almost a hybrid between a message queue and a database?

If Kafka supported data processing (queries or whatever) then it would be much closer to databases. Also, databases are normally aware of the structure of the data (for example, columns).

Therefore, Kafka can hardly be viewed as a DBMS because it explicitly separates two major concerns:

* data management - how to represent data (Kafka)

* data processing - how to derive/infer new data (Kafka Streams - a separate library)

Theoretically, if they could combine these two layers of functionality in one system then it would be a database.

Ezku7y ago

techno_modus7y ago

jto12187y ago

love this article about the log abstraction and some of the motivation behind Kafka. It's a great resource to get a feel for what it is and how you can use it:

https://engineering.linkedin.com/distributed-systems/log-wha...

j / k navigate · click thread line to collapse