undefined | Better HN

0 pointsthrowaway0123_51y ago0 comments

> Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.

My understanding is that it works by checking if the proposed solution passes test-cases included in the original (human) PR. This seems to present some problems too, because there are surely ways to write code that passes the tests but would fail human review for one reason or another. It would be interesting to not only see the pass rate but also the rate at which the proposed solutions are preferred to the original ones (preferably evaluated by a human but even an LLM comparing the two solutions would be interesting).

0 comments

Bjorkbat1y ago

If I recall correctly the authors of the benchmark did mention on Twitter that for certain issues models will submit an answer that technically passes the test but is kind of questionable, so yeah, good point.

j / k navigate · click thread line to collapse

0 pointsthrowaway0123_51y ago0 comments

0 comments

Bjorkbat1y ago

j / k navigate · click thread line to collapse