Transformer: A Novel Neural Network Architecture for Language Understanding (opens in new tab)

(research.googleblog.com)

280 pointsandrew37268y ago19 comments

19 comments

Very interesting approach, and intuitively it makes sense to treat language less as a sequence of words over time and more as a collection of words/tokens with meaning in their relative ordering.

Now I'm wondering what would happen if a model like this were applied to different kinds of text generation like chat bots. Maybe we could build actually useful bots if they can have attention on the entire conversation so far and additional meta data. Think customer service bots with access to customer data that can learn to interpret questions, associate it with their account information through the attention model and generate useful responses.

kazinator8y ago

A "collection of words in a relative ordering" is "a sequence of words".

real-hacker8y ago

They mentioned tensor2tensor, how is this related to another repo: seq2seg:https://github.com/google/seq2seq? Which one is more general?

hacker_98y ago

No doubt a holy grail for chat bots, but I'll believe it when I see it.

VikingCoder8y ago

"Even without evidence, everyone should believe it will solve the problem, but I won't believe it will solve the problem until there's evidence."

Is that what you just said? :)

emeijer8y ago

The harder problem, as usual, is to get enough high quality training data for a particular problem domain though.

Maybe as a POC we can try building a bot that generates relevant HN comments given the post and parent comments. Maybe I'm such a bot, how could I possibly know?

devindotcom8y ago

DeepL (was on HN earlier this week) also uses an attention-based mechanism like this (or at least, with the same intention and effect). They didn't really talk about it but the founder mentioned it to me. The two seem to have independently pursued the technique, perhaps from some shared ancestor like a paper they both were inspired by.

albertzeyer8y ago

Discussion about DeepL: https://news.ycombinator.com/item?id=15122764

Attention is not new. Everyone uses it (for translation and many related tasks). It's very much the standard right now.

Avoiding recurrent connections inside the encoder or decoder is also not completely new. That came up when people tried to only use convolutions.

Googles Transformer was made public in June 2017, in the paper Attention is all you need, https://arxiv.org/abs/1706.03762, including TensorFlow code, https://github.com/tensorflow/tensor2tensor . Note that the new thing here is that they neither use recurrence nor convolution but rely entirely on self-attention instead, with simple fully-connected layers, in both the encoder and the decoder.

DeepL directly compares their model to Transformer, in terms of performance (BLEU score), here: https://www.deepl.com/press.html

cosminro8y ago

Attention is used in machine translation since 2014 as far as i know (Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate) in various forms.

sdenton48y ago

The paper linked in the blog was actually published back in June. And as the sibling comment says, people have been trying to use attention mechanisms to augment recurrent networks for some time. The novel idea here seems to be sorting the recurrent network, and just keeping the attention mechanism.

rayuela8y ago

The key to this paper is the "Multi-Head Attention" which looks a lot like a Convolutional layer to me.

jatsign8y ago

Has anyone come across good ML to do arabic-english? Seems to be a complete lack of decent training data.

woodson8y ago

There's the GALE data provided by LDC (https://catalog.ldc.upenn.edu/):

- GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24) - GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09) - GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02) - GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) - GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09) - GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06) - GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14) - GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17) - GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18) - GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)

lozenge8y ago

Isn't Arabic an official UN language?

mykeliu8y ago

I'm a novice when it comes to neural network models, but would I be correct in interpreting this as a convolutional network architecture with multiple stacked encoders and decoders?

knowtheory8y ago

Nah, the paper explicitly states that their system is not recurrent nor convolutional:

> To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using RNNs or convolution.

sandGorgon8y ago

would something like this work well on mixed/pidgin languages - e.g. Hinglish , which is a mixture of Hindi and English and used in daily vernacular ?

bra-ket8y ago

does it mean we don't need gradient descent after all to achieve the same result?

sanxiyn8y ago

Nope, Transformer is still trained with gradient descent.

j / k navigate · click thread line to collapse

19 comments

emeijer8y ago

Very interesting approach, and intuitively it makes sense to treat language less as a sequence of words over time and more as a collection of words/tokens with meaning in their relative ordering.

kazinator8y ago

A "collection of words in a relative ordering" is "a sequence of words".

real-hacker8y ago

They mentioned tensor2tensor, how is this related to another repo: seq2seg:https://github.com/google/seq2seq? Which one is more general?

hacker_98y ago

No doubt a holy grail for chat bots, but I'll believe it when I see it.

VikingCoder8y ago

"Even without evidence, everyone should believe it will solve the problem, but I won't believe it will solve the problem until there's evidence."

Is that what you just said? :)

emeijer8y ago

The harder problem, as usual, is to get enough high quality training data for a particular problem domain though.

Maybe as a POC we can try building a bot that generates relevant HN comments given the post and parent comments. Maybe I'm such a bot, how could I possibly know?

devindotcom8y ago

albertzeyer8y ago

Discussion about DeepL: https://news.ycombinator.com/item?id=15122764

Attention is not new. Everyone uses it (for translation and many related tasks). It's very much the standard right now.

Avoiding recurrent connections inside the encoder or decoder is also not completely new. That came up when people tried to only use convolutions.

DeepL directly compares their model to Transformer, in terms of performance (BLEU score), here: https://www.deepl.com/press.html

cosminro8y ago

sdenton48y ago

rayuela8y ago

The key to this paper is the "Multi-Head Attention" which looks a lot like a Convolutional layer to me.

jatsign8y ago

Has anyone come across good ML to do arabic-english? Seems to be a complete lack of decent training data.

woodson8y ago

There's the GALE data provided by LDC (https://catalog.ldc.upenn.edu/):

lozenge8y ago

Isn't Arabic an official UN language?

mykeliu8y ago

I'm a novice when it comes to neural network models, but would I be correct in interpreting this as a convolutional network architecture with multiple stacked encoders and decoders?

knowtheory8y ago

Nah, the paper explicitly states that their system is not recurrent nor convolutional:

sandGorgon8y ago

would something like this work well on mixed/pidgin languages - e.g. Hinglish , which is a mixture of Hindi and English and used in daily vernacular ?

bra-ket8y ago

does it mean we don't need gradient descent after all to achieve the same result?

sanxiyn8y ago

Nope, Transformer is still trained with gradient descent.

j / k navigate · click thread line to collapse