The skill we do splits it into multiple passes to divide and conquer on task dimension and on files for that exact reason. Likewise, it loops (Ralph-ish) until it converges. It maintains a task queue and work log to stay on track. We are growing it over time, but now more about per-repo customization, while the bones are good cross-repo.
I would only trust frontier-grade harnesses to do this kind of skill run, and guilty-until-proven-innocent various harnesses x prompt combos because of that.
My point isn't that our 1 page skill eliminates the need for your startup, but that is a normal flow for more serious ai-augmented coders so you are picking a blatantly known-bad starting point for serious coders. That makes it unclear what value your tool brings and calls into question why you are refusing to measure yourself in a post about measurement.