Now I'm wondering what would happen if a model like this were applied to different kinds of text generation like chat bots. Maybe we could build actually useful bots if they can have attention on the entire conversation so far and additional meta data. Think customer service bots with access to customer data that can learn to interpret questions, associate it with their account information through the attention model and generate useful responses.
Is that what you just said? :)
Maybe as a POC we can try building a bot that generates relevant HN comments given the post and parent comments. Maybe I'm such a bot, how could I possibly know?
Attention is not new. Everyone uses it (for translation and many related tasks). It's very much the standard right now.
Avoiding recurrent connections inside the encoder or decoder is also not completely new. That came up when people tried to only use convolutions.
Googles Transformer was made public in June 2017, in the paper Attention is all you need, https://arxiv.org/abs/1706.03762, including TensorFlow code, https://github.com/tensorflow/tensor2tensor . Note that the new thing here is that they neither use recurrence nor convolution but rely entirely on self-attention instead, with simple fully-connected layers, in both the encoder and the decoder.
DeepL directly compares their model to Transformer, in terms of performance (BLEU score), here: https://www.deepl.com/press.html
- GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24) - GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09) - GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02) - GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) - GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09) - GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06) - GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14) - GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17) - GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18) - GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
> To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using RNNs or convolution.