undefined | Better HN

0 pointsstingraycharles6d ago0 comments

Am I the only one who wasn’t particularly impressed by AutoResearch? If you looked at what the agent was actually doing, it was just tuning parameters mostly, not really trying different novel approaches.

I couldn’t help myself but consider this mostly a very inefficient variant of hyperparameter optimization, but someone correct me if I’m wrong, I may be looking at this too pessimistic.

0 comments

lacker6d ago

I didn't dig into what the actual repository was doing, but personally, I took some inspiration from the idea after reading about it and realizing that I might have been underestimating the ability of LLMs. I put a bit more work into a performance harness I was using locally and just set some agents to brainstorming and they did seem to find some great stuff. So I don't really have a stance one way or another on this specific repo, but the general idea seems like a really good one.

delis-thumbs-7e5d ago

Could you elaborate in specifics how you had been underestimating models? Ypu mean just using more tighter harnessing to make them work in structured agentic eay or something else?

lacker5d ago

The specific code I was working on, I had a general idea of the sort of performance improvement that would be possible. I just thought that it would be too hard for the models to figure out without a lot of hand-holding.

But it ended up being not "too hard ever", but more like, in 1 out of every 5 tries, the model did in fact manage to get a large refactoring to the point where it improved performance. So once I set it up to try something, use the perf test, see if it worked, if not, throw it away, repeat. Then it started, slowly, finding some useful things.

inciampati5d ago

Just remember that the will do clever but useless things to improve. Like changing the random seed as per autoresearch's hero image. lol! imo, out of the box thinking is needed.

druub5d ago

Ever since AlphaEvolve - the idea that if you build a harness which can evaluate solutions and give LLMs a database where they can keep storing their work and then sample from it - they do find non-trivial solutions over time leaning from their own past ideas.

It is the ultimate manifestation of test-time scaling. I think karpathy just popularised it.

clbrmbr6d ago

Karpathy embedded within an organization is way more impressive than him out on his own with hot takes and little projects. I hope he does great things for Anthropic.

stingraycharlesOP5d ago

Absolutely, I wasn’t saying that him being at Anthropic wasn’t going to be effective, I just think his little projects wouldn’t be very interesting if his name wasn’t attached to them.

vdelpuerto5d ago

I was trying to look options outside the box (everything is more context or RAG) and been using this approach for about a month with good results. https://github.com/VDP89/fscars

latentsea5d ago

I was impressed that I was able to take the same basic idea and apply it to anything that a Claude could construct a metric for. It's nice being able to just run /autoresearch and speed up your test suites, and shave time off your builds etc.

It's a decent tool to have in the toolbox.

teravor6d ago

    > Am I the only one who wasn’t particularly impressed by AutoResearch?

isn't it just a nerfed AlphaEvolve? https://arxiv.org/abs/2506.13131

DesaiAshu6d ago

Inefficient variants with $100m+ worth of compute will still probably outperform the best team of researchers

godelski5d ago

That's not the question. The question is how much you need to give the best team of researchers to beat $100m+ worth of compute. $1m of compute? $10m? Clearly giving the best team $100m is going to beat out giving an efficient group $100m. It does in fact matter who you throw your money at...

j / k navigate · click thread line to collapse