You could argue game engines are notoriously complex, but the Linux kernel would like a word.
Software modules are commoditized at this point. Just pick building blocks and spend all your time on tying them together in ever increasing complexity. Curl (and similar tools) clearly is a commodity, but that web scraper that gives you the ability to compare Apple to Orange in real-time, that may be what gives you the edge. And the first-order analysis of said data is even more likely to be what lifts you above the competition.
Thanks for the effort you're putting into this!
The constraint is largely hardware. The incremental post training done via transfer learning is generally not broadly applicable to many use cases.
MoEs would work well with this paradigm. The whole point is to have discrete fully-separate experts, so if you train on one task and I train on another, our patches won't likely touch the same experts even without any special tricks. You could even go so far as to patch the dispatch layer and plug in a brand new expert(s). MoEs would be able to accumulate lots of patches and merge them with little difficulty. If this paradigm catches on, it might well justify MoEs on its own, regardless of the touted benefits of more efficient training & much cheaper forward passes.
Perceiver would have more trouble. Perceiver is like a RNN for Transformers: it's relatively few weights, applied repeatedly & intensively to a small latent encoding the knowledge about the input. Even with tricks, patches are going to fight over how to change those weights and change the encoded knowledge. A few patches might work, but a lot will get ugly.