undefined | Better HN

0 pointsvaline2y ago0 comments

Model merging is usually done with different fine-tunes of the same model. It doesn’t work if the base models are different.

One of the more surprising things is that you can actually repeat layers to improve model performance, ie 1-1-2-2 instead of 1-2. That’s how you get models with higher parameter counts than the original.

0 comments

namibj2y ago

C.f. also Universal Transformer: the same layer stacked a lot. The sparse version of that is basically MoE with also a stick-breaking mechanism to prevent vanishing gradient while letting the model decide whether to terminate layer-count at a token early (ofc with training rewards to favor less layers, to represent the compute savings).

j / k navigate · click thread line to collapse

0 pointsvaline2y ago0 comments

Model merging is usually done with different fine-tunes of the same model. It doesn’t work if the base models are different.

0 comments

namibj2y ago

j / k navigate · click thread line to collapse