undefined | Better HN

0 pointsmike_hearn2y ago0 comments

Right, I get that: it increases learning capacity, but doesn't introduce more parameters? Like the GPU requirements would be the same beyond the extra cost of the ReLU operation itself, yes?

0 comments

spi2y ago

Yes of course, sorry my write-up was confusing: I meant that "adding a ReLU between the two linear layers" (the second option) would result in more parameters than "directly removing the second linear layer" (the first option). And my message just meant "I don't know which of the two options achieves the best trade-off between speed and quality". I didn't consider the option "leave it as it is in the blog post" because it is essentially equivalent to the first option (removing the linear layer) but slower (as you say, with exactly the same number of parameters as the second option), so it definitely shouldn't be a "best" option.

mike_hearnOP2y ago

Thank you!

j / k navigate · click thread line to collapse

0 pointsmike_hearn2y ago0 comments

Right, I get that: it increases learning capacity, but doesn't introduce more parameters? Like the GPU requirements would be the same beyond the extra cost of the ReLU operation itself, yes?

0 comments

spi2y ago

mike_hearnOP2y ago

Thank you!

j / k navigate · click thread line to collapse