alpb on Hacker News

Ask HN: How different is compute orchestration for AI?

If there are folks here that work on LLM providers on managing the compute server/workload orchestration on training or inference side, I'm curious what's the state of the art in this area is.

I understand Kubernetes has a fair amount of frameworks for training and serving, but I assume it's not the best tool for running large-scale GPU clusters (at least not out of the box). Many cloud providers started providing ultrascale but low-pod density Kubernetes clusters for this. I also assume there are still many orchestrators like Slurm still around for these kinds of job, and I remember Open AI trying to build their own orchestrator for training jobs.

I also assume spatial locality between servers, infiniband/RDMA also matter a lot more than Kubernetes provider native support for, and server health story must be completely different since GPUs fail a lot more often, and they have a lot more interesting metrics to monitor on top of standard OS metrics.

What are some articles or blogs to read in this space to come up to speed on how GPU/ML compute orchestration happens in the state of the art today?

1alpb6mo ago0

Ask HN: Are the tech levels now diluted post-pandemic (Google/Meta)

I am curious compared to the pre-pandemic levels, how diluted the level bands got nowadays. Any Googlers or Meta engineers around here who can answer that?

For example, does the average (p50) L6 engineer in 2019 vs 2025 still present the same level of rigor and experience? Or what bottom percentile of an L5 or L6 level accidentally get hired/promoted to that level (used to be <p10).

2alpb6mo ago0

Ask HN: How different is compute orchestration for AI?

If there are folks here that work on LLM providers on managing the compute server/workload orchestration on training or inference side, I'm curious what's the state of the art in this area is.

What are some articles or blogs to read in this space to come up to speed on how GPU/ML compute orchestration happens in the state of the art today?

1alpb6mo ago0

Ask HN: Are the tech levels now diluted post-pandemic (Google/Meta)

I am curious compared to the pre-pandemic levels, how diluted the level bands got nowadays. Any Googlers or Meta engineers around here who can answer that?

2alpb6mo ago0

alpb

Recent submissions

Google Spanner On-Prem (Spanner Omni) (opens in new tab)

It's Not Your Codebase (opens in new tab)

Ask HN: How different is compute orchestration for AI?

Ask HN: Are the tech levels now diluted post-pandemic (Google/Meta)

Is it cynical to do what your manager wants? (opens in new tab)

Recent submissions

Google Spanner On-Prem (Spanner Omni) (opens in new tab)

It's Not Your Codebase (opens in new tab)

Ask HN: How different is compute orchestration for AI?

Ask HN: Are the tech levels now diluted post-pandemic (Google/Meta)

Is it cynical to do what your manager wants? (opens in new tab)