undefined | Better HN

0 pointsfoobarian3y ago0 comments

> while somebody comfortable is shell writes a parallelized one liner, rips through GBs of data, and delivers the answer in 15 minutes.

This also works up to a point where those GBs turn into hundreds of GBs, or even PBs, and a proper distributed setup can return results in seconds.

0 comments

henrydark3y ago

I often find that downloading lots of data from s3 using `xargs aws sync`, and then xargs on some crunching pipeline, is much faster than a 100 core spark cluster

snidane3y ago

That's a hardware management question. The optimized binary used in my shell script still runs orders of magnitude faster and cheaper if you orchestrate 100 machines for it than any Hadoop, Spark, Beam, Snowflake, Redshift, Bigquery or what have you.

That's not to say I'd do everything in shell. Most stuff fits well into SQl, but when it comes to optimizing processing over TB or PB scale, you won't beat shell+massive hw orchestration.

ekianjo3y ago

usually you use specific frameworks for that, not pure Python.

foobarianOP3y ago

I suppose the Python side is a strawman then - who would do that for a small dataset that fits on a machine? Or have I been using shell for too long :-)

ekianjo3y ago

I thought the above comment was about datasets that do not fit on ones machine?

j / k navigate · click thread line to collapse

0 comments

henrydark3y ago

I often find that downloading lots of data from s3 using `xargs aws sync`, and then xargs on some crunching pipeline, is much faster than a 100 core spark cluster

snidane3y ago

That's not to say I'd do everything in shell. Most stuff fits well into SQl, but when it comes to optimizing processing over TB or PB scale, you won't beat shell+massive hw orchestration.

ekianjo3y ago

usually you use specific frameworks for that, not pure Python.

foobarianOP3y ago

I suppose the Python side is a strawman then - who would do that for a small dataset that fits on a machine? Or have I been using shell for too long :-)

ekianjo3y ago

I thought the above comment was about datasets that do not fit on ones machine?

j / k navigate · click thread line to collapse