I'm in the process of attempting just that, with limited success. In my case, I trained a classifier that takes the current surroundings of the player unit and tries to predict that we'll gain an advantage in this segment of the game. I split the game into segments based on when the HP relationships between teams change. And gaining an advantage then means that you take more HP from the enemy team than what you and your teammates lost.
The classifier has on average 90% accuracy which seems good. I then use the likelihood predicted by this classifier to compute the weight with which I want to train each action and if I want to train it positively (by pulling its likelihood of being chosen up) or negatively (pushing the likelihood of that action down).
However, what this model cannot correctly represent is the fact that whether or not a given situation will turn out to be good or bad in the long term is highly dependent on how you play. So if I train this with replay data, I will score the situations in relation to how well those (outdated) AIs could take advantage of them.
Next up, I'll try to fix this issue by introducing a graph-like stochastic structure. The basic idea is that I encode "from this state S if I take action A, then I can reach state T with P percent likelihood" into yet another neural network. If I then identify a state which is really beneficial in the sense that I can reliably convert it into an advantage, then I can use this graph to back-propagate that knowledge so that I get "from this state S, action A takes me to state T, then action B takes me to state U, and U is great".
That should allow me to train with historical data to identify which transitions are possible, and then I can combine that with realtime data about the desirability of each state. So basically I'd do A* pathfinding over the graph of possible states to identify which actions are needed to bring me from my current situation into the closest "I will surely win" situation. Except that the graph is memorized by an AI because the real state-space is huge: 15x15 fields with 6 units + 5 environment states => roughly 11^(15*15) states