1A brief investigation into the GPT-5.5 regression claims (opens in new tab)(stet.sh)1bisonbear5d ago0
2The Opus 4.7 reasoning curve - Medium is the best default? (opens in new tab)(stet.sh)1bisonbear12d ago0
3GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (opens in new tab)(stet.sh)2bisonbear17d ago0
4GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (opens in new tab)(stet.sh)4bisonbear24d ago0
5I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (opens in new tab)(stet.sh)2bisonbear1mo ago0
6Coding evals are broken. CI is green while AI code quality goes unmeasured (opens in new tab)(stet.sh)1bisonbear1mo ago0
7Agents.md is the highest-leverage code you're not testing (opens in new tab)(stet.sh)1bisonbear1mo ago0