undefined | Better HN

0 pointsOvbiousError9mo ago0 comments

I don't think the human is the problem here, but the time it takes to run the full testing suite.

0 comments

tlb9mo ago

Yes, and (some near-future) AI is also more patient and better at multitasking than a reasonable human. It can make a change, submit for full fuzzing, and if there's a problem it can continue with the saved context it had when making the change. It can work on 100s of such changes in parallel, while a human trying to do this would mix up the reasons for the change with all the other changes they'd done by the time the fuzzing result came back.

LLMs are worse at many things than human programmers, so you have to try to compensate by leveraging the things they're better at. Don't give up with "they're bad at such and such" until you've tried using their strengths.

HappMacDonald9mo ago

You can't run N bots in parallel with testing between each attempt unless you're also running N tests in parallel.

If you could run N tests in parallel, then you could probably also run the components of one test in parallel and keep it from taking 2 hours in the first place.

To me this all sounds like snake oil to convince people to do something they were already doing, but by also spinning up N times as many compute instances and run a burn endless tokens along the way. And by the time it's demonstrated that it doesn't really offer anything more than doing it yourself, well you've already given them all of your money so their job is done.

abdullin9mo ago

Running tests is already an engineering problem.

In one of the systems (supply chain SaaS) we invested so much effort in having good tests in a simulated environment, that we could run full-stack tests at kHz. Roughly ~5k tests per second or so on a laptop.

abdullin9mo ago

Humans tend to lack inhumane patience.

diggan9mo ago

It is kind of a human problem too, although that the full testing suite takes X hours to run is also not fun, but it makes the human problem larger.

Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?

Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.

If you're feeling brave, you start Robot B and C at the same time.

abdullin9mo ago

This is the workflow that ChatGPT Codex demonstrates nicely. Launch any number of «robotic» tasks in parallel, then go on your own. Come back later to review the results and pick good ones.

diggan9mo ago

Well, they're demonstrating it somewhat, it's more of a prototype today. First tell is the low limit, I think the longest task for me been 15 minutes before it gives up. Second tell is still using a chat UI which is simple to implement, easy to implement and familiar, but also kind of lazy. There should be a better UX, especially with the new variations they just added. From the top of my head, some graph-like UX might have been better.

1 more reply

TeMPOraL9mo ago

Worked in such a codebase for about 5 years.

No one really cares about improving test times. Everyone either suffers in private or gets convinced it's all normal and look at you weird when you suggest something needs to be done.

diggan9mo ago

There a few of us around, but it's not a lot, agree. It really is an uphill battle trying to get development teams to design and implement test suites the same way they do with other "more important" code.

londons_explore9mo ago

The full test suite is probably tens of thousands of tests.

But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.

Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.

Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.

tele_ski9mo ago

It's an interesting idea, but reactive, and could cause big delays due to bisecting and testing on those regressions. There's the 'old' saying that the sooner the bug is found the cheaper it is to fix, seems weird to intentionally push finding side effect bugs later in the process because faster CI runs. Maybe AI will get there but it seems too aggressive right now to me. But yeah, put the automation slider where you're comfortable.

Byamarro9mo ago

I work in web dev, so people sometimes hook code formatting as a git commit hook or sometimes even upon file save. The tests are problematic tho. If you work at huge project it's a no go idea at all. If you work at medium then the tests are long enough to block you, but short enough for you not to be able to focus on anything else in the meantime.

9rx9mo ago

Unless you are doing something crazy like letting the fuzzer run on every change (cache that shit), the full test suite taking a long time suggests that either your isolation points are way too large or you are letting the LLM cross isolated boundaries and "full testing suite" here actually means "multiple full testing suites". The latter is an easy fix: Don't let it. Force it stay within a single isolation zone just like you'd expect of a human. The former is a lot harder to fix, but I suppose ending up there is a strong indicator that you can't trust the human picking the best LLM result in the first place and that maybe this whole thing isn't a good idea for the people in your organization.

j / k navigate · click thread line to collapse

0 comments

tlb9mo ago

HappMacDonald9mo ago

You can't run N bots in parallel with testing between each attempt unless you're also running N tests in parallel.

If you could run N tests in parallel, then you could probably also run the components of one test in parallel and keep it from taking 2 hours in the first place.

abdullin9mo ago

Running tests is already an engineering problem.

abdullin9mo ago

Humans tend to lack inhumane patience.

diggan9mo ago

It is kind of a human problem too, although that the full testing suite takes X hours to run is also not fun, but it makes the human problem larger.

If you're feeling brave, you start Robot B and C at the same time.

abdullin9mo ago

This is the workflow that ChatGPT Codex demonstrates nicely. Launch any number of «robotic» tasks in parallel, then go on your own. Come back later to review the results and pick good ones.

diggan9mo ago

1 more reply

TeMPOraL9mo ago

Worked in such a codebase for about 5 years.

No one really cares about improving test times. Everyone either suffers in private or gets convinced it's all normal and look at you weird when you suggest something needs to be done.

diggan9mo ago

londons_explore9mo ago

The full test suite is probably tens of thousands of tests.

But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.

Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.

Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.

tele_ski9mo ago

Byamarro9mo ago

9rx9mo ago

j / k navigate · click thread line to collapse