HBM won't be on-die, but it will be on-package - HBM relies on chip stacking to get the desired throughput in a small surface area, regardless the latency and throughput would stomp system DRAM something awful, and if it's a proper L4 cache then the CPU would benefit as well.
IBM does something similar (though not for graphics) in recent POWER CPU's with the Centaur memory controller(s), they are off-chip memory controllers with a bunch of eDRAM to act as a L4 cache (though the difference here is each system has multiple centaur controllers to handle different DIMM slots). They're able to burst to ~96GB/sec to system memory using this, having a good amount of on-package HBM would probably yield similar gains :)