Haskell improves log processing 4x over Python (opens in new tab)

(devblog.bu.mp)

114 pointsjmintz15y ago41 comments

41 comments

The work sounds very cool (and they are hiring), but (only) a factor of 4 speedup over Python is (to repeat a phrase from elsewhere today) like boasting that you're the tallest midget ;o)

jamwt15y ago

Hi, article author here.

It's important to note that this particular job is largely bound on a.) I/O and b.) format serialization tasks. Both Python's BSON and JSON libraries are mature and have their critical sections written in C, so a speedup of 4x is still noteworthy. The Haskell version, on the other hand, is pure Haskell.

andrewcooke15y ago

Neat - thanks.

jbellis15y ago

Agreed. Even where you can optimize the hot code in C, Python is no speed demon. Cassandra's java stress test can push out about 10x as many ops/s as the python one, even though Thrift C extension for Python is quite good.

/still a Python fan

jamwt15y ago

Yeah, Haskell is roughly as fast as Java. I imagine, with the tuning I allude to in the blog post, we could restore about a 10x improvement, even on a single core. After all, 10x was about what we had with raw Redis ops/s before the serialization libraries got involved.

/also still a python fan :-)

2 more replies

Peaker15y ago

Sounds great. I'm a very big Haskell fan.

I'd love to point people to this when trying to convey some advantages of Haskell. To make it more compelling, can you expand some on the downsides and maybe obstacles you encountered?

The thing I'm unsure about, is how difficult it would be for (very) talented developers to just jump in. We have really talented developers, and everyone is super time-constrained, so many are wary of diving into a language as different as Haskell. Was it hard for your developers to figure Haskell out? Did your previous use of Scala help? How long did it take them to dive into Scala?

jamwt15y ago

I would say the two real barriers to writing effective Haskell projects are a.) "getting" monads, and b.) understanding the implications of laziness, especially with regard to space leaks and unconsumed thunks. Everything else isn't that big of a deal.

It's all much easier to digest, though, even for "really talented developers", if they have some experience with another functional language first. OCaml is a nice stepping stone before digging into the abstractions involved in understanding Haskell's powerful type system. Scala is good too, but having the object stuff mixed in there can lead you to rely on some patterns that aren't going to be available in a non-OOP language. I think the scheme/clojure path isn't bad either, but it's probably ideal to spend some time in the "statically typed" wing of the functional universe before going to Haskell.

samstokes15y ago

Could you say more about why "getting" monads was needed?

I came to Haskell with no understanding of monads, started writing code, and eventually used my knowledge of Haskell to learn about monads. Not understanding monads just meant I was lacking a useful design pattern, and found certain API docs confusing, but it didn't stop me from writing reasonable code in most circumstances.

On the other hand what you describe in your (awesome) blog post is a more significant Haskell project than any I've worked on, so I'd be interested to hear your experience.

I've not really written my own monad, or properly looked into monad transformer stacks, and I'm aware that I could probably clean up a lot of code using them - is that the sort of thing you mean?

1 more reply

grav1tas15y ago

I agree. I dove into Haskell without doing any of the prior, and it was like running into a brick wall. However, persistence paid off in my case, but I do wonder how I would have handled it if I would have spent time with OCaml beforehand.

microtonal15y ago

From personal experience: I didn't make much progress in Haskell until I stopped using Scala. The problem is that Scala allows you to mix and match different paradigms and if you come from a mostly-imperative/OO background, you tend to use Scala as an OO language with some functional constructs.

To learn to program purely functional, it's best to jump into Haskell cold-turkey, since you will have to learn to think in FP.

Learning Haskell, optimization in a lazy world was the most difficult task. Often, I still have problems predicting how efficient particular code will be. The complexity of monads is somewhat overstated, though it doesn't help that some tutorials make something big and esoteric out of it. It is nothing more than a type class, that specifies how to combine computations that result in some 'boxed value'.

Locke168915y ago

The author is mostly write about the usage cases of Haskell, but simply "systems" is a bit misleading because there are certain performance characteristics of lazy programs which make them bad choices for some systems programs. Any type of real-time system, for example, can suffer unpredictable performance in critical sections, which is pretty undesirable.

dons15y ago

Hard real time systems are probably the primary thing for which Haskell-as-is is directly unsuitable.

Haskell as an EDSL for generating hard real time, however, is very viable: http://corp.galois.com/blog/2010/9/22/copilot-a-dsl-for-moni...

awj15y ago

Not to argue the example, but Python's garbage collection disqualifies it for real-time systems as well. In fact, I'm having a hard time find a "system" task for which Python (as a language) is qualified by Haskell is not.

Locke168915y ago

Python is not a systems programming language.

1 more reply

jamwt15y ago

While I agree with you that Haskell (or, really, any GC'd language) is unsuitable for real-time systems, I disagree that my statement about its excellent suitability for systems programming in general is misleading. There are many, many domains (read: most) that, in my experience, are called "systems programming" that have nothing to do with hard or soft real-time requirements.

Now, if I had stated that all conceivable systems programming domains are addressable with Haskell, that would have indeed been foolish.

Locke168915y ago

Hm, good point -- I agree.

ynniv15y ago

Are the logs being read from disk? In my experience, python is highly optimized for reading (possibly compressed) files from disk. If your infrastructure keeps logs in memory, python will lose this advantage and compete on computational performance where Haskell has the advantage. This is important for those of us who grind logs on disk and might be considering a language switch.

enneff15y ago

What do you mean by optimized? Python makes the same read and write syscalls everyone else does.

What you're probably observing is Python's slow code generation being masked by the inherent slowness of I/O.

ynniv15y ago

Python makes the same read and write syscalls everyone else does

Except, when python's pants are on, it makes gold records.

I haven't looked to see if there are any explicit optimizations, but your statement is ridiculous; an effective IO strategy can have an enormous effect on performance.

1 more reply

jamwt15y ago

Nope, this is a process that BLPOP's logs from some Redis queue, does some processing on them, then writes them to disk.

kordless15y ago

I'd be interested in hearing more about how the author is using the resulting data set. Doing extractions at event generation time can be very useful if you know what you are after in advance, but not so good for adhoc analysis.

Any reason why you didn't use Hadoop for this, then run batch jobs to extract summaries?

jamwt15y ago

Yeah, the whole pipeline is actually quite more faceted than can be deduced from this summary. This stage actually just persists the events into a consolidated transaction log. Then, there are secondary processes that scan these transaction logs (in batch) and distribute data into various databases for system, business, and user analytics. I can't go into too much detail there, but the actual digesting and reporting side is more involved.

kordless15y ago

I'd like to hear more about the use case if you have time, and can talk about it. I'm kordless at loggly dot com.

aristus15y ago

Awesome work. If you haven't heard about Tim Bray's WideFinder challenge, it was really interesting.

http://tartarus.org/james/diary/2008/06/17/widefinder-final-...

j / k navigate · click thread line to collapse

41 comments

andrewcooke15y ago

The work sounds very cool (and they are hiring), but (only) a factor of 4 speedup over Python is (to repeat a phrase from elsewhere today) like boasting that you're the tallest midget ;o)

jamwt15y ago

Hi, article author here.

andrewcooke15y ago

Neat - thanks.

jbellis15y ago

/still a Python fan

jamwt15y ago

/also still a python fan :-)

2 more replies

Peaker15y ago

Sounds great. I'm a very big Haskell fan.

I'd love to point people to this when trying to convey some advantages of Haskell. To make it more compelling, can you expand some on the downsides and maybe obstacles you encountered?

jamwt15y ago

samstokes15y ago

Could you say more about why "getting" monads was needed?

On the other hand what you describe in your (awesome) blog post is a more significant Haskell project than any I've worked on, so I'd be interested to hear your experience.

I've not really written my own monad, or properly looked into monad transformer stacks, and I'm aware that I could probably clean up a lot of code using them - is that the sort of thing you mean?

1 more reply

grav1tas15y ago

microtonal15y ago

To learn to program purely functional, it's best to jump into Haskell cold-turkey, since you will have to learn to think in FP.

Locke168915y ago

dons15y ago

Hard real time systems are probably the primary thing for which Haskell-as-is is directly unsuitable.

Haskell as an EDSL for generating hard real time, however, is very viable: http://corp.galois.com/blog/2010/9/22/copilot-a-dsl-for-moni...

awj15y ago

Locke168915y ago

Python is not a systems programming language.

1 more reply

jamwt15y ago

Now, if I had stated that all conceivable systems programming domains are addressable with Haskell, that would have indeed been foolish.

Locke168915y ago

Hm, good point -- I agree.

ynniv15y ago

enneff15y ago

What do you mean by optimized? Python makes the same read and write syscalls everyone else does.

What you're probably observing is Python's slow code generation being masked by the inherent slowness of I/O.

ynniv15y ago

Python makes the same read and write syscalls everyone else does

Except, when python's pants are on, it makes gold records.

I haven't looked to see if there are any explicit optimizations, but your statement is ridiculous; an effective IO strategy can have an enormous effect on performance.

1 more reply

jamwt15y ago

Nope, this is a process that BLPOP's logs from some Redis queue, does some processing on them, then writes them to disk.

kordless15y ago

Any reason why you didn't use Hadoop for this, then run batch jobs to extract summaries?

jamwt15y ago

kordless15y ago

I'd like to hear more about the use case if you have time, and can talk about it. I'm kordless at loggly dot com.

aristus15y ago

Awesome work. If you haven't heard about Tim Bray's WideFinder challenge, it was really interesting.

http://tartarus.org/james/diary/2008/06/17/widefinder-final-...

j / k navigate · click thread line to collapse