This is still perfectly synonymous with regular build tools, like running a rebuild in Jenkins or ‘build with parameters.’ The point is to treat builds and runs of an experiment setup exactly the same, with the same tooling, monitoring, data capturing, etc., as any other deployed program. There is nothing special about a one-off job that trains a model or computes an experimental result compared with jobs that perform an experiment on database tuning or test load on a web service or any other type of deployed job. You have monitoring and probing of key stats and health of the experiment, you can reproduce the exact run or the same run with modified parameters, and the run produces output artifacts or writes data. It’s all perfectly the same.
Basically if someone shows me a supposed ML experiment tracking system, the first question is, “If I replace the phrase ‘ML experiment’ with ‘generic computing task’, does the tool still handle everything exactly the same?”
If not, it’s a failed idea, because you’re trying to break model training or tuning jobs out of the regular deployment model and you’re not using consistent tooling to manage deployment of experiment runs and all other types of “jobs” that you can “run.”