Btw, I'm interested to hear how well training with large one-hot encoded vectors scales. A paper someone pointed me to recently on HN suggested that it doesn't scale very well:
One-shot Learning with Memory-Augmented Neural Networks [https://arxiv.org/abs/1605.06065]
https://github.com/explosion/spaCy/tree/master/examples/kera...
This got dropped during editing...Updating the post to make this more prominent.
1. In Step (2), Bidirectional RNN: what are you making the forward/backward passes over? How do the tokens get turned into a "matrix" ? What is the dimensionality of this matrix?
2. Step 3 is a bit unclear. Where do Parikh et. al. get their 2 matrices from?
It would be nice to bring in some concreteness: talk about sentences, documents, etc. and how they map into this scheme.
Thanks!
I'll answer briefly about the Parikh et al model.
1) Input: (ids1, ids2). These are integer-typed arrays of length len1 and len2
2) sent1 = embed(ids1); sent2 = embed(ids2). Data is now real-value arrays of shape (len1, vector_dim) and (len2, vector_dim) respectively. 300 is a common value for vector_dim, e.g. from the GloVe common crawl model.
3) sent1 = encode(sent1); sent2 = encode(sent2). Data is now real-valued arrays of shape (len1, fwd_dim+bwd_dim), (len2, fwd_dim+bwd_dim).
4a) attention = create_attention_matrix(sent1, sent2). This is a real-valued array of shape (len1, len2)
4b) align1 = soft_align(sent1, attention); align2 = soft_align(sent2, transpose(attention)). These are a real-valued array of shape (len1, compare_dim), (len2, compare_dim)
4c) feats1 = sum(map(compare(sent1, align2))); feats2 = sum(map(compare(sent2, align1))). These are real-valued arrays of shape (predict_dim,), (predict_dim,)
5. class_id = predict(feats1, feats2)
The post describes steps 4a, 4b and 4c as a single operation that takes the two 2-dimensional sentence representations as input and outputs a single vector (obtained by concatenating the representations feats1 and feats2 in this description).
Also, really well done on the site design. Love the graphics, font, layout and 'progress bar' animation at the top. Very nice UX overall.
I understand matrix multiplication but it seems that (some of) these matrix to vector calculations are actually trained by/as part of the neural net... but how exactly that works I can't figure out coming at it from articles like this.
Thanks! I'm planning to make two follow up posts, on each of the systems, that go through those details. I blurred them out in this post because I wanted to get across this more abstract story about the data types and transformations.
There are lots of good posts about attention mechanisms. The WildML post is good, as is Chris Olah's post. Bidirectional RNNs are a little bit less well covered, but the idea is not too difficult to understand given a single RNN (or LSTM, GRU etc).
You should also read the papers :). This is how most people who are doing ML --- including the people building practical things, not researchers --- are staying up to date. Academia is so competitive and writing is cheap relative to experimentation. The deep learning literature is really pretty easy to follow.
What do you think about dilated convolutional encoder/decoder networks [1]? Useful for NLP beyond machine translation?
[1] https://arxiv.org/abs/1610.10099, https://github.com/paarthneekhara/byteNet-tensorflow
I just wish I understood the rest of the article...