undefined | Better HN

0 pointsJoshCole3y ago0 comments

The type of learning problem where I agree with your point is in something like learning how to classify hand written digits. My point about the continuous nature being unsearchable in practice is about recursive forms - if I choose this policy, my opponent will choose to react to the fact that I had that policy.

In your learning problem where thing were made tractable by differentiation you have something like an elevation map that you are following, but in the multi-stage decision problem you have something more like a fractal elevation map. When you want to know the value of a particular point on the elevation map you have to look for the highest point or the lowest point on the elevation map you get by zooming in on the area which is the resultant of your having chosen a particular policy.

The problem is that since this is a multi-agent environment they can react to your policy choice. So they can for example choose to have you get a high value only if you have the correct password entered on a form. That elevation map is designed to be a plain everywhere and another fractal zoom corresponding with a high utility or a low error term only at the point where you enter the right password.

Choose a random point and you aren't going to have any information about what the password was. The optimization process won't help you. So you have to search. One way to do that is to do a random search; if you do that you eventually find a differing elevation - assuming one exists. But what if there were two passwords - one takes you to a low elevation fractal world that corresponds with a low reward because it is a honeypot. The other takes you to the fractal zoom where the elevation map is conditioned on you having root access to the system.

This argument shows us that we actually would need to search over every point to get the best answer possible. Yet if we do that we have to search over the entire continuous distribution for our policy. Since by definition there are an infinite number of states a computer with infinite search speed can't enumerate them; there is another infinite fractal under every policy choice that also needs full enumeration. We have non-termination by a diagonalization argument for a computer that has infinite speed.

Now observe that in our reality passwords exist. Less extreme - notice that reacting to policy choice in general, for example, moving out of the way of a car that drives toward you but not changing the way you would walk if it doesn't, isn't actually an unusual property in decision problems. It is normal.

0 comments

snovv_crash3y ago

I get what you're saying about recursive adversarial problems and their fractal nature, but this is exactly what GANs do to great success, despite the fact that it's hard. Yes, they have to train a lot slower, but learning general strategies and patterns in opponent behaviour still works.

Your password example on the other hand is a discrete, non-differentiable example. If it was differentiable - for example instead of a true/false you got an edit distance to the real password, then passwords would be trivial to crack.

JoshColeOP3y ago

I am taking about decision problems, you are taking about learning problems. These are different. Skip past the idea that you need to learn something. You’ve finished doing so.

What happens once we learn an approximation of that landscape; a map that has error, it doesn’t correspond fully with the territory.

The cognitive bias framing calls the map biased, but if you generalize from that to a more global sense of irrationality the reasoning is in error. In a more particular situation you have a simpler game tree because it is just the game tree under the node. The lifting of constraints produces the ability to have further insight - the map has to be an approximation.

Don’t reach for edit distance; make the boolean a Maybe Boolean which needs further resolution. See that the approximation is demanded because the world isn’t setup to allow all things to be learnable. My honeypot example is simpler than reality - there exists passwords for which trying to guess the password but getting the honeypot resolves to the learner being jailed; generally the learner in the actual game wouldn’t even get to have infinite guesses either, but I made the problem simpler to expose the problem complexity in terms that learning theory would be more familiar with - the elevation maps of the error landscape that learners like to slide down.

snovv_crash3y ago

Decision problems are a subset of learning problems. As soon as someone can simulate your environment there is no negative consequence to further exploring the solution space via differentiable evaluation methods which allow efficiently training an optimal player.

1 more reply

j / k navigate · click thread line to collapse

0 comments

snovv_crash3y ago

JoshColeOP3y ago

I am taking about decision problems, you are taking about learning problems. These are different. Skip past the idea that you need to learn something. You’ve finished doing so.

What happens once we learn an approximation of that landscape; a map that has error, it doesn’t correspond fully with the territory.

snovv_crash3y ago

1 more reply

j / k navigate · click thread line to collapse