If you are not familiar with RL, I recommend first reading the two articles that the author links to:
- https://www.alexirpan.com/2018/02/14/rl-hard.html
- https://himanshusahni.github.io/2018/02/23/reinforcement-lea...
They are no so recent anymore, but still capture the problem well.
Long story short: RL doesn't work yet. We're not sure it'll ever work. Some big companies are betting that it will.
> My own hypothesis is that the reward function for learning organisms is really driven from maintaining homeostasis and minimizing surprise.
Both directions are actively researched: maximizing surprise (to improve exploration), and minimizing surprise (to improve exploitation).
See eg "Exploration by Random Network Distillation" for the first, "SURPRISE MINIMIZING RL IN DYNAMIC ENVIRONMENTS" for the second.