undefined | Better HN

0 pointsNiloCK3mo ago0 comments

Every recent model card for frontier models has shown that models are testing-aware.

Seems entirely plausible to me here that models correctly interpret these questions as attempts to discredit / shame the model. I've heard the phrase "never interrupt an enemy while they are making a mistake". Probably the models have as well.

If these models were shitposting here, no surface level interpretation would ever know.

0 comments

puttycat3mo ago

> models correctly interpret these questions as attempts to discredit / shame the model

So they respond by... discrediting themselves?

j / k navigate · click thread line to collapse

0 comments

puttycat3mo ago

> models correctly interpret these questions as attempts to discredit / shame the model

So they respond by... discrediting themselves?

j / k navigate · click thread line to collapse