Alternatively, you could indeed put a ReLU or other non linearity between embedding and linear, you get a different model with more layers and more parameters, as the given dataset is pretty large I’m quite sure this would bring an improvement to accuracy, but without testing it’s rather impossible to know. Normalisation also acts as some kind of non linearity, but when the author adds it that barely helps accuracy at all, so who knows, sometimes (often) neural networks are counter intuitive…