Model merging is usually done with different fine-tunes of the same model. It doesn’t work if the base models are different.
One of the more surprising things is that you can actually repeat layers to improve model performance, ie 1-1-2-2 instead of 1-2. That’s how you get models with higher parameter counts than the original.
C.f. also Universal Transformer: the same layer stacked a lot.
The sparse version of that is basically MoE with also a stick-breaking mechanism to prevent vanishing gradient while letting the model decide whether to terminate layer-count at a token early (ofc with training rewards to favor less layers, to represent the compute savings).