Skip to content

Top New Best Ask Show Jobs

bisonbear | Better HN

bisonbear

39 karmaJoined September 17, 202557 submissions

Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@stet.sh

Recent submissions

1

A brief investigation into the GPT-5.5 regression claims (opens in new tab)

(stet.sh)

1bisonbear5d ago0

2

The Opus 4.7 reasoning curve - Medium is the best default? (opens in new tab)

(stet.sh)

1bisonbear12d ago0

3

GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (opens in new tab)

(stet.sh)

2bisonbear17d ago0

4

GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (opens in new tab)

(stet.sh)

4bisonbear24d ago0

5

I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (opens in new tab)

(stet.sh)

2bisonbear1mo ago0

6

Coding evals are broken. CI is green while AI code quality goes unmeasured (opens in new tab)

(stet.sh)

1bisonbear1mo ago0

7

Agents.md is the highest-leverage code you're not testing (opens in new tab)

(stet.sh)

1bisonbear1mo ago0