1
Ask HN: Tool for doing embarrassingly parallel processing on cluster
I'm looking for a tool that surely exists but I can't seem to find it. I have thousands of small video files, and I'd like to process each one independently across a cluster. The processing can defn benefit from GPU for some parts of the workload (e.g. object detection); it is all pytorch code. Ideally, would like to use cloud computing for this (so EKS would be preferable but can also spin up machines and setup slurm, etc. if need be). I've looked into GNU Parallel, Dask and PySpark. I'm a bit frustrated that most examples assume you have numpy arrays/csvs at the start .. surely there exists a tool that deals with this at the "files" and "scripts" level? I feel the closest match I have seen is Hadoop but again, one where I can chain scripts together.