Streaming Search on Tweets: Storm, Elasticsearch, and Redis (opens in new tab)

(insightdataengineering.com)

59 pointsdallas-stuart10y ago7 comments

7 comments

Question for people in the know: as I read more about Erlang and Elixir, I'm starting to wonder how it holds up in such an architecture. Would it have similar performance characteristics?

Would something like spark/storm/etc be simpler to implement on top of Elixir? Possibly more performant? Or make better use of CPU cores?

thibaut_barrere10y ago

I'm currently working at implementing all types of ETL & data extraction processes with Elixir, and so far I must say I like this (although I cannot share numbers are the moment, just yet).

My own impression (as the author of http://www.kiba-etl.org which is a Ruby ETL framework) is that Elixir is going to be a great fit for high-quality, decent throughput data processing of all kinds (streaming, batching, or other such as websocket-api-endpoint types).

The immutable data + lightweight processes + ability to go distributed + concise code (testable, composable etc) is truly appealing in my opinion.

I will share more stuff online (at http://thibautbarrere.com) when I can, but I can only say this is so far very promising.

atonse10y ago

Great, thanks for your reply!

lovelearning10y ago

I didn't understand why author used Redis for the final delivery, when a Kafka cluster was already available. Is there a reason to use Redis here?

rwalk33310y ago

The author here. I used Redis because I needed a simple way to have users subscribe to and follow many different search queries. In Kafka, this would mean maintaining a very large number of lightly populated topics. As I understand it, that isn't a good way to use Kafka.

latenightcoding10y ago

I had the same doubt thanks for clarifying

otoolep10y ago

Thanks for sharing.

>In the streaming case, documents arrive at a very fast rate (e.g. average of 6000 per second in the case of Twitter) and with this kind of velocity and volume it is impractical to build the inverted document index in real-time.

Actually, it's completely practical. It's expensive, but it can be done. When I was at Loggly we built a big ES cluster, pumped 10,000s log messages into it per second, and served queries. Ingest to query was on the order of seconds - it still runs today. The key, of course, is not to build a single inverted index.

I don't want to underestimate the work, but it is practical. And expensive.

http://www.slideshare.net/AmazonWebServices/infrastructure-a...

j / k navigate · click thread line to collapse

7 comments

atonse10y ago

Question for people in the know: as I read more about Erlang and Elixir, I'm starting to wonder how it holds up in such an architecture. Would it have similar performance characteristics?

Would something like spark/storm/etc be simpler to implement on top of Elixir? Possibly more performant? Or make better use of CPU cores?

thibaut_barrere10y ago

I'm currently working at implementing all types of ETL & data extraction processes with Elixir, and so far I must say I like this (although I cannot share numbers are the moment, just yet).

The immutable data + lightweight processes + ability to go distributed + concise code (testable, composable etc) is truly appealing in my opinion.

I will share more stuff online (at http://thibautbarrere.com) when I can, but I can only say this is so far very promising.

atonse10y ago

Great, thanks for your reply!

lovelearning10y ago

I didn't understand why author used Redis for the final delivery, when a Kafka cluster was already available. Is there a reason to use Redis here?

rwalk33310y ago

latenightcoding10y ago

I had the same doubt thanks for clarifying

otoolep10y ago

Thanks for sharing.

I don't want to underestimate the work, but it is practical. And expensive.

http://www.slideshare.net/AmazonWebServices/infrastructure-a...

j / k navigate · click thread line to collapse