1Alignment is not free: How model upgrades can silence your confidence signals (opens in new tab)(variance.co)121karinemellata10mo ago67
2We used sparse autoencoders to explain LLM moderation flags of violent threats (opens in new tab)(variance.co)6karinemellata11mo ago0