I understand Kubernetes has a fair amount of frameworks for training and serving, but I assume it's not the best tool for running large-scale GPU clusters (at least not out of the box). Many cloud providers started providing ultrascale but low-pod density Kubernetes clusters for this. I also assume there are still many orchestrators like Slurm still around for these kinds of job, and I remember Open AI trying to build their own orchestrator for training jobs.
I also assume spatial locality between servers, infiniband/RDMA also matter a lot more than Kubernetes provider native support for, and server health story must be completely different since GPUs fail a lot more often, and they have a lot more interesting metrics to monitor on top of standard OS metrics.
What are some articles or blogs to read in this space to come up to speed on how GPU/ML compute orchestration happens in the state of the art today?
For example, does the average (p50) L6 engineer in 2019 vs 2025 still present the same level of rigor and experience? Or what bottom percentile of an L5 or L6 level accidentally get hired/promoted to that level (used to be <p10).