undefined | Better HN

0 pointsspi2y ago0 comments

Well it depends what you mean by “best” :-) removing the linear layer is the easiest solution (indeed you can’t remove the embedding one; in theory you could replace embedding + linear by one hot encoding + linear, adapting the input dimension or the linear layer to match your vocabulary size, but that would just be identical to embedding layer, just much slower and more memory hungry).

Alternatively, you could indeed put a ReLU or other non linearity between embedding and linear, you get a different model with more layers and more parameters, as the given dataset is pretty large I’m quite sure this would bring an improvement to accuracy, but without testing it’s rather impossible to know. Normalisation also acts as some kind of non linearity, but when the author adds it that barely helps accuracy at all, so who knows, sometimes (often) neural networks are counter intuitive…

0 comments

mike_hearn2y ago

Why does adding a ReLU create more layers and parameters? Isn't the total number of neurons the same?

hansvm2y ago

The representational capacity of two consecutive linear layers is the same as one slightly different linear layer. The capacity when you introduce a relu into the mix is (up to a complexity defined by the number of parameters) any "nice" function -- including things like e^sin(x) -- not just linear functions. With two consecutive linear layers many of the weights and computations are redundant.

mike_hearn2y ago

Right, I get that: it increases learning capacity, but doesn't introduce more parameters? Like the GPU requirements would be the same beyond the extra cost of the ReLU operation itself, yes?

1 more reply

j / k navigate · click thread line to collapse

0 pointsspi2y ago0 comments

0 comments

mike_hearn2y ago

Why does adding a ReLU create more layers and parameters? Isn't the total number of neurons the same?

hansvm2y ago

mike_hearn2y ago

Right, I get that: it increases learning capacity, but doesn't introduce more parameters? Like the GPU requirements would be the same beyond the extra cost of the ReLU operation itself, yes?

1 more reply

j / k navigate · click thread line to collapse