undefined | Better HN

0 pointsyeck2y ago0 comments

Isn't that a massive quality improvement though? How many applications for LLMs are not feasible right now because of the ability for models to be swayed of course by a gentle breeze? If AI is a ship with a sail, data is the wind and alignment is the equivalent of a rudder.

0 comments

p1esk2y ago

The ability for models to be (easily) swayed is a different problem. I don’t see how AI safety would help with that.

yeckOP2y ago

> models to be (easily) swayed is a different problem

No, this is the alignment problem at a high level. You want a model to do X but sometimes it does Y.

Mechanistic interpretability, one area of study in AI alignment, is concerned with being able to reason about how a network "makes decisions" that lead it to an output.

If you wanted an LLM that doesn't succumb to certain prompt injections, it could be very helpful to be able to identity key points in the network that took the AI out of bounds.

Edit: I should add, I'm not referring to AI safety, I'm referring to AI alignment.

p1esk2y ago

You want a model to do X but sometimes it does Y

That’s too broad. Any AI problem falls under this characterization.

Also, AI interpretability and AI alignment are distinct subfields. Partially overlapping, but distinct goals.

j / k navigate · click thread line to collapse

0 comments

p1esk2y ago

The ability for models to be (easily) swayed is a different problem. I don’t see how AI safety would help with that.

yeckOP2y ago

> models to be (easily) swayed is a different problem

No, this is the alignment problem at a high level. You want a model to do X but sometimes it does Y.

Mechanistic interpretability, one area of study in AI alignment, is concerned with being able to reason about how a network "makes decisions" that lead it to an output.

If you wanted an LLM that doesn't succumb to certain prompt injections, it could be very helpful to be able to identity key points in the network that took the AI out of bounds.

Edit: I should add, I'm not referring to AI safety, I'm referring to AI alignment.

p1esk2y ago

You want a model to do X but sometimes it does Y

That’s too broad. Any AI problem falls under this characterization.

Also, AI interpretability and AI alignment are distinct subfields. Partially overlapping, but distinct goals.

j / k navigate · click thread line to collapse