But the claim that "one expert is 17B" is incorrect. Experts are picked with per-layer granularity (expert 1 for layer X may well be entirely unrelated to expert 1 for layer Y), and the individual layer-experts are tiny. The writeup for the original experiment is very clear on this.