With the complexity of GPU driver stacks, what you are asking for is not firmware, but a multi GHz+ set of CPUs just for that purpose.
+ RPC needed all the time... with its latency would tank the performance
It'd also be not tinkerable at all unlike what we have today, it's exactly advocating for the opposite of open.