undefined | Better HN

0 pointsstingraycharles2mo ago0 comments

One expert is 17B, but more than one expert can be active at any time. I believe it’s actually more like 80B active.

0 comments

I don't think this is correct, "active parameters" is quite unambiguous in that it means a sum of all active experts plus shared parameters.

fouc2mo ago

looks like they meant “effective dense size” which is the square root of total params×active params, so in this case sqrt(397 x 17) = ~82

zozbot2342mo ago

But the claim that "one expert is 17B" is incorrect. Experts are picked with per-layer granularity (expert 1 for layer X may well be entirely unrelated to expert 1 for layer Y), and the individual layer-experts are tiny. The writeup for the original experiment is very clear on this.

1 more reply

j / k navigate · click thread line to collapse

0 comments

zozbot2342mo ago

I don't think this is correct, "active parameters" is quite unambiguous in that it means a sum of all active experts plus shared parameters.

fouc2mo ago

looks like they meant “effective dense size” which is the square root of total params×active params, so in this case sqrt(397 x 17) = ~82

zozbot2342mo ago

1 more reply

j / k navigate · click thread line to collapse