In your learning problem where thing were made tractable by differentiation you have something like an elevation map that you are following, but in the multi-stage decision problem you have something more like a fractal elevation map. When you want to know the value of a particular point on the elevation map you have to look for the highest point or the lowest point on the elevation map you get by zooming in on the area which is the resultant of your having chosen a particular policy.
The problem is that since this is a multi-agent environment they can react to your policy choice. So they can for example choose to have you get a high value only if you have the correct password entered on a form. That elevation map is designed to be a plain everywhere and another fractal zoom corresponding with a high utility or a low error term only at the point where you enter the right password.
Choose a random point and you aren't going to have any information about what the password was. The optimization process won't help you. So you have to search. One way to do that is to do a random search; if you do that you eventually find a differing elevation - assuming one exists. But what if there were two passwords - one takes you to a low elevation fractal world that corresponds with a low reward because it is a honeypot. The other takes you to the fractal zoom where the elevation map is conditioned on you having root access to the system.
This argument shows us that we actually would need to search over every point to get the best answer possible. Yet if we do that we have to search over the entire continuous distribution for our policy. Since by definition there are an infinite number of states a computer with infinite search speed can't enumerate them; there is another infinite fractal under every policy choice that also needs full enumeration. We have non-termination by a diagonalization argument for a computer that has infinite speed.
Now observe that in our reality passwords exist. Less extreme - notice that reacting to policy choice in general, for example, moving out of the way of a car that drives toward you but not changing the way you would walk if it doesn't, isn't actually an unusual property in decision problems. It is normal.