Awesome, thanks for clarifying. So does the training optimize some property of the "semantic" layer immediately before the final emoji prediction layer? Or does it just optimize accuracy of emoji prediction directly?
And then the t-SNE projection shown in the article is based on this same layer (one before prediction)?