Replace YAML with proper "general purpose" featureful configuration language like Nix (or a typed version of it: Nickel). You could do Dhall or Jsonnet, but IMO, Nix is the way to go and keeps gaining traction.
Abandon Dockerfiles, and let the containers be described in the same general purpose language (see https://nixos.wiki/wiki/NixOS_Containers ), that the cluster can build and handle the same way.
Have the cluster itself be described in the same configuration language.
Wrap it all the proper document library APIs, so the most common use-cases, are a handful of code, with ability to expand to lower level details once it needed. Hey, did I mention that we have a general purpose configuration language that makes it possible?
At the end it should be possible to `git init .`, write 10 lines of Nix code (import stdlib, import 1 node standard cluster, run a servive with hello-world application), call `deploy-this` with your cloud access key and get a k8s cluster with a hello world application behind a TLS.
Then when you decide you need more nodes, more services, you just add new lines, or rewrite existing high level ones into more detailed calls.
All of this currently required gluing tons of things together (Docker, Terraform, kops, Helm, Kustomize, ArgoCD and what not) all using different programming languages, xmls, jsons, yamls, HCLs, Dockerfiles and so and so.
Hard to take this seriously since nix-lang is the largest (but not the only) roadblock to nix adoption in my observations. Besides, it hardly provides any benefits other than an even more exotic syntax.
I have experiences with two forms of deployment: one was an on-prem installation two years ago, on which one of the 13 Java 8 applications had very large latencies when accessing Oracle DB, otherwise working fine when it was deployed on simple VM. All of those applications had been done with the same DB logic, and we couldn't find the issue on our own, so we asked third-party to debug this issue for us: they couldn't pin point the problem, even with commercial tools. Their answer was just, something is off with your Kubernetes installation and that was it.
My second, on-going experience, is my current assignment with Fortune 500 company, that uses GKE for running hundreds of nodes, after migrating from on-prem VMs. Almost every other week (99% reliability - yeah right), some part of the system just dies and leaves services unreachable or unresponsive. There is a continuous effort to solve this issues and even Google support was contacted with the answer boiling down to: shit happens, deal with it. The only solution in those situations is either to have alarms go off so that Ops can restart something, or just to wait until everything comes back up again on its own.
The whole ecosystem was a good idea that lacks proper tool and stability to provide substantial benefits over the bunch of VMs, IMO.
The entire point of Kubernetes is redundancy through multiple stateless service instances that can and will be killed at any moment. If you take an application that doesn't work in such an environment, for instance a highly stateful application, that will cause pain. If you want simple 'lift and shift' to the cloud avoid Kubernetes.
Editing helm charts it the ultimate form of throwing spaghetti at the wall and praying it works, without being able to actually test anything locally.
Example: https://github.com/dhall-lang/dhall-kubernetes
Haven’t used dhall myself but I’d definitely prefer a DSL on top of yaml.
Yamls do have a scheme, editors can autocomplete it.
I've heard that helm is painful. Never used it.
Inner peace arises from looking at all of the possibility and limiting yourself to what you actually need.
Constraints are the only way you will ever escape from complexity jail.
Trouble is that some understand this equation and don't want out. Getting this stuff solved means you have to start doing actual work and dealing with customer needs again. Equity is the only thing I think can solve this at scale. There aren't many who are willing to push through "kill my ego" hell just so they can sit in line at "help the customer" hell. You need a pretty big carrot (or stick) to sell this one.
The flexibility of allowing many different CRIs, CNIs and storage providers does let you find something that suits your use-case, but that generality also means the error messages can't be as specific as a fully integrated solution.
The issue is exacerbated when the OS and cluster are hardened to CIS benchmarks with policy managers - trying to figure out what is preventing your app from running and why is just so much harder than it needs to be.
Cluster API does not have an integration with Karpenter and there are more limitations [1], rendering it unusable in my use case.