On package interconnection latency and power is way lower than going off package, AMD also doubled the L3 cache size to compensate for increased memory latency. The issue of putting too much IO in single die is that perimeter of die/package needs to fit all the IO traces on substrate/motherboard, which means more layers and more costs. Everything is just a performance/cost/power trade off. But I would say that off package controllers dealing with multiple CPU chips are probably less viable now than they were before current core count increase.It's because synchronization traffic would require insanely large busses to those controllers, if you wanted to have lots of sockets, and then you would need a lot more pins to those.
On the other hand if you had those on package you would get a lot more bandwidth at much lower power and latency, which is what AMD has done.