undefined | Better HN

0 pointsma2rten5y ago0 comments

Fair enough, sparse usually means weights are sparse and not activations.

Obviously you can compare parameter count if you really want to, but from a technical point of view training a densely activated model is a much bigger feat. Also, I have personally spoken to one of the authors of this paper and they said sparsely activated models tend to well better on tasks that require knowledge but not tasks that require intelligence (e.g. GLUE).

0 comments

cs7025y ago

I agree, training a dense model with the same number of parameters would be much a bigger feat.

Otherwise, as I mentioned elsewhere on this page, we routinely describe the size of the human brain in terms of numbers of synapses (connections), even though they are sparsely activated. Only a small subset of your brain 'lights up' for a given input. Number of parameters (connections) is a perfectly sensible way to measure model size.

Anyway, I expect we will see both much larger sparsely and densely activated models going forward. We live in interesting times :-)

j / k navigate · click thread line to collapse

0 comments

cs7025y ago

I agree, training a dense model with the same number of parameters would be much a bigger feat.

Anyway, I expect we will see both much larger sparsely and densely activated models going forward. We live in interesting times :-)

j / k navigate · click thread line to collapse