DeepMind’s new AI with a memory outperforms algorithms 25 times its size (opens in new tab)

(singularityhub.com)

324 pointsdarkscape4y ago132 comments

132 comments

Very interesting. GPT-J is an opensource free alternative to GPT-3 and requires at least 12.1GB memory to run the model (which is reduced from original 48GB ram). But if the model stores some kind of index and does internet searches (or hard drive) instead, then it could scale much further as there is a limit on how much memory you can use in production.

endymi0n4y ago

Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte. That, parked on a fast NVMe SSD will give you roundabout 1MM random lookups per second. Even with some transfers inbetween, this should be more than enough to not just perform in equal time, but probably less — as well as cost you less than the GPU you need for the reduced size model.

Exciting times.

GistNoesis4y ago

The real problem with (NVMe) SSD is that they have a limited number of write cycles (a max TB written).

If you don't update your database and indices they are great. But that's something really tempting to do when you do some machine learning, (specially if you know that people with deeper pockets will do so).

Typically you will have a neural network, you run it on your dataset, it produces a new dataset of embeddings, you index them, and you use this index to train a new neural network, and you repeat the loop, hopefully improving results along the way.

NVMe SSD can write at 6GB/s but can only write ~800TB that's about 37 hours of lifetime at max speed.

iggldiggl4y ago

> Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte.

"Only" 825 GB actually: https://pile.eleuther.ai/

A not-insignificant fraction of that is definitively copyrighted material, though, which raises some interesting questions when switching to a model of distributing "a smaller trained model plus the original raw training data" (though it seems that the team behind GPT-J are clearly happy to distribute their full set of data anyway, and seem to be enough under the radar to not attract the wrong sort of attention,at least for now).

1 more reply

sbierwagen4y ago

>48GB ram

48GB VRAM? 48+ gigabytes of system ram is cheap, 48 gigabytes of ram on a GPU is still painfully expensive.

moffkalast4y ago

Yes, in this GPU market that's essentially a new car's worth of cash.

1 more reply

JudasGoat4y ago

The Amd APU's would be interesting although under powered. They give you the option of setting "VRAM" size to almost any percentage of system memory.

schleck84y ago

GPUs are the slowing factor in general if I'm not mistaken when it comes to Deep Learning progress.

charcircuit4y ago

Yes, vram

1 more reply

spywaregorilla4y ago

Isn't this what watson used to do?

neom4y ago

A neural net with access to wikipedia is faster than than a neural net that contains Wikipedia? Seems odd to call it AI with a memory though... unless I'm misunderstanding. It's more like AI with a decent memory and an understanding of how to use an encyclopedia.

robbedpeter4y ago

Yeah, memory implies persisted state in the model, this is static lookups separate from the transformer.

Still superb, though, there's no reason you can't use other gofai tools vs a static database, to trigger expert systems or formalized reasoning.

visarga4y ago

It's not gofai. It's locality sensitive hashing published in 2008.

1 more reply

neom4y ago

I got into a very long debate with an openai person 4/5 years ago about this + adversarial learning + access to a quantum computer (think just straight up world class abacus) was close to the primitives required for more generalized AI. They didn't agree with me, but that's ok! :)

4 more replies

LoveMortuus4y ago

I wonder what we can learn from AI models about how humans work.

Like, could we assume that for humans it's also faster to search for information on Wikipedia or would it be faster to recall from memory of already read Wikipedia? Although with humans stored information decay is present. (In a way a human form of garbage collection :P).

meiji1634y ago

I'd be interested to see if these models are robust against algorithms like TextFooler [0]. I'm skeptical this trend of 10x'ing the parameters will solve the "clever hans" problem.

[0]: https://github.com/jind11/TextFooler

amitport4y ago

Dup https://news.ycombinator.com/item?id=29486607

(This is a different blogpost, but does not seem to add over the original)

Edit: following derac's comment see https://news.ycombinator.com/item?id=29646112 for RETRO

derac4y ago

This article is about RETRO [0], not Gopher.

[0] https://deepmind.com/research/publications/2021/improving-la...

davefol4y ago

Seems odd to claim 25x reduction in size when the algo involves looking into a database of a trillion chunks of text.

visarga4y ago

The "algo" here refers to the neural net itself. The text index is considered an easy problem as you can do lookups in logarithmic time.

ehsankia4y ago

The word "Algo" here is definitely awkward. The point is though that what matters most here is the number of parameters, as those correlate quite closely with training and inference time. Storage space is pretty trivial, but TPU cycles are less so.

davefol4y ago

Thanks for this. Rereading + your comment and I think I have a better understanding of why this is progress.

AJRF4y ago

I've not kept track of where large transformers like this have gotten to, GPT3 and the like - has GPT3 made any real difference to the world? Are people using it? Has it vastly improved any software?

tveita4y ago

It's a safe bet that Google is using transformers at scale for search and translations - the full extent isn't public but they release a fair amount of research papers, e.g. the current article, or https://ai.googleblog.com/2020/06/recent-advances-in-google-...

Github Copilot is definitely GPT-3-based and is seeing real-world use https://copilot.github.com

Transformers are state of the art for many tasks so they are likely to be used for "intelligent" processing of text or speech data, but due to practical limitations you are probably interacting with them mostly through web services.

muzani4y ago

I don't know about world changing but it's saved me hundreds of hours. I use it to help read academic papers, put formatting on things like markdown and subtitles, and creative writing. A lot of the things that take it 15 seconds to do take me 2 minute and drain me mentally for about 15 mins.

If anything, it's being used in force for social media marketing, where you're trying to say "buy this thing" in different ways every day.

regularfry4y ago

Forgive the ignorance, but how? What tools are you using on top of GPT3 to do those things?

3 more replies

wiz21c4y ago

Seconding someone else's comment : what is your workflow for those tasks ? How does it help you to read academic papers ? Or to put formatting on markdown ?

1 more reply

ggm4y ago

If we point it at the horrendously bad scots wiki (some kid in the US decided he'd translate Wikipedia into what he thought was lowland scots/Doric.. it's a disaster) we might get entertainingly bad outcomes.

buro94y ago

Oh wow, that's a fun rabbit hole: https://www.theguardian.com/uk-news/2020/aug/26/shock-an-aw-...

bawolff4y ago

Note, the stuff written by said kid has long since been deleted. However i have no idea what the quality of the rest of scowiki is.

ggm4y ago

It's not awful, but I feel it's still pretty meh. I say this as a person raised in Edinburgh in the sixties and seventies. How bad? Well.. in their backend meta pages they link to the DSL (Dictionary of the scots language/dictionars o' Scots Leid [0]) which says this:

Written Scots In the written mode, Scots spelling remains variable. Attempts to make it more consistent, notably the Scots Style Sheet produced by the Makars’ Club in 1947 or the Recommendations for Writers in Scots published by the Scots Language Society in 1985, have had at best only limited success, competing with other systems that have been developed to represent more closely localized varieties of spoken Scots.

When your reference text says the language isn't yet well captured in a single print, you better believe the wiki page is a hot mess.

[0] https://dsl.ac.uk/about-scots/what-is-scots/

LudwigNagasena4y ago

Well, it is pretty hard to make something in a language when it is a dialect continuum and not a standardized variety that is forced onto the whole population through the education system and media.

1 more reply

jari_mustonen4y ago

Could someone explain the article to layman engineer?

visarga4y ago

It's language modelling with search engine in-the-loop.

Instead of training GPT-3 with 178B weights, you train a 25x smaller model and allow it to retrieve useful snippets from a large text index as additional information.

This solves the problem of very large models and the problem of updating an already trained model, as you can swap the text corpus with a newer one. The model learns mostly syntax, burning less trivia in its weights than a regular LM as it can simply copy the relevant information from the index.

This development was bound to happen as large LMs are expensive to use and it was an obvious idea. We've had these semantic search text indices for a few years already[1], they just weren't combined with text generation.

[1] https://github.com/spotify/annoy

alkonaut4y ago

So the memory doesn't solve the context problem of e.g. "conversation context"? I.e. the storage isn't modified while the model is used? If I make an app that makes conversation using such a model model, then the storage isn't modified to insert knowledge about what the early parts of the conversation was about, and it's only bringing a database of fixed information into the conversation? (I have a friend who is just like that).

2 more replies

jamesblonde4y ago

Yes, the key technology here is a scalable embedding store. The leading players here are the indexes - faiss and scann. The open source platforms are opensearch, elasticsearch, featureform, milvius. Then there are saas products like pinecone.

Blikkentrekker4y ago

> Gebru, a widely respected leader in AI ethics research, is known for coauthoring a groundbreaking paper that showed facial recognition to be less accurate at identifying women and people of color, which means its use can end up discriminating against them.

Surely this is a function of location? I understand the U.S.-English term “person o color” to be convoluted language for “not white”. One simple thing I notice is that if I search for, say, “child” on Google Image Search, the images indeed tend to look as what one would expect from the average inhabitant of an English-speaking nation, when I search “子供”, I indeed mostly see what I would expect from Japan. Similarly, if I search for “house”, what I find tends to look like a house most likely situated in the Netherlands; with “บ้าน”, it does resemble more so stereotypical Thai architecture.

I would assume that a.i.'s made in, say, Japan would yield different results.

b9a2cab54y ago

The meaning in woke terminology is more subtle than just "not white". For example Asians would likely be excluded in this case, and Middle Easterners and other minorities. "People of color" in this case means blacks and dark skinned Latinos.

The idea that AI itself can be biased (as opposed to the dataset) also has some significant problems. The lead of Facebook AI Research got canceled on Twitter because he pointed out that it's the bias in the dataset used to train the AI that results in bias in the AI and not the AI itself that's biased. I'd also question whether Gebru is a "widely respected leader in AI ethics research". Model interpretability is not even close to a solved problem so just because you can demonstrate some correlation between images of black people and worse performance does not imply that "black person" is a causative factor. It could literally be dataset distribution or image contrast or any number of other plausible explanations that are easily fixable by an ML engineer.

tsimionescu4y ago

The AI is the final product of applying a learning algorithm on a training set.

Claiming that "the AI is not biased, the training set is" is like saying "this running program isn't buggy, it's just the source code that is buggy".

mschuster914y ago

Searching for "child" or "house" will yield what has been classified as such in training - and searching for Japanese or Thai labels will do the same. No surprise there, if the labels don't get normalized before training.

And normally, that's harmless - as you said, you'd expect to see an AI finding pictures of houses in the region/culture you are searching it. But in a multi-cultural/multi-ethnic society, searching for "people" and showing up only what is considered the "majority" has a whole different lot of ethical implications.

Identifying and ideally remediating such issues is why ethics research is so sorely needed.

Blikkentrekker4y ago

> And normally, that's harmless - as you said, you'd expect to see an AI finding pictures of houses in the region/culture you are searching it.

I am not actually; I am searching for “huis”, not “Nederlands huis”; I'd expect the result I obtain from the former with the latter.

I'd actually expect “house” and “huis” to reveal similar results from a good search engine. Obviously this is not easily possible with how it is trained with corpora in a specific language, but from usability I think this is undesirable, if I specifically want Dutch houses I can always add that term as a specification; there is no way to simply search for houses, wherever they might be, in Dutch, or English, or Thai, or any other language.

That is to say, I'm not arguing that there is no problem; I'm arguing that the problem is highly dependent upon location, and that he article should not take such a U.S.A.-centric stance and act as though the reset of the world not exist.

b9a2cab54y ago

No, remediating such issues (only predicting the maximum likelihood class in the dataset) is a problem of _machine learning and optimization_ research, not ethics research. There is nothing an ethicist can do to solve this problem. It is easy to point out problems with existing AI and write a bunch of papers to get yourself tenure. It is very hard to fundamentally advance our understanding of deep learning models past a fancy maximum likelihood estimation problem.

1 more reply

xmprt4y ago

The problem is that AI (and the English language to some extent) transcends borders. So even if it's an AI developed in the US, it can potentially impact people outside the US and it makes ethical sense to build something that doesn't exclude groups based on arbitrary conditions.

Blikkentrekker4y ago

Yes, but to offset that, many a.i. in English were also made outside of English-speaking regions, in what one assumes to be proportional degree.

This is probably why there is more variance when searching for English terms as wel, as a Lingua Franca. If I search “house” I do see some styles of architecture not commonly found in Anglo-Saxon nations, whereas all occurrences of “huis” do seem to be situated in the Netherlands.

1 more reply

baxter0014y ago

Completely not the focus of the article, and you've turned the result of an error rate of 0.8 percent for gender classification of light-skinned men and a 34.7 percent error rate for the same classifier on dark-skinned women - into some kind of google image search language game?

I can only quote Joy Buolamwini on this:

“To fail on one in three, in a commercial system, on something that’s been reduced to a binary classification task, you have to ask, would that have been permitted if those failure rates were in a different subgroup?”

b9a2cab54y ago

The answer would probably be yes if that subgroup wasn't a large percentage of the dataset used for training and testing. Or if that subgroup wasn't a large percentage of the user base.

Come on, if you've worked at any large company using ML you know model performance is literally just taking the average accuracy/ROC/precision/etc over your training dataset plus some hold out sets. Then you track proxy metrics like engagement to see if your model actually works in production. At no point does race come into the equation. Naturally, if your choice of subgroup happens to not be a large proportion of either the dataset or the userbase then you don't see the poor performance on that subgroup show up in your metrics so you don't care to fix it.

1 more reply

baalimago4y ago

Okay. Now make the small AI with memory 25 times bigger!

rapjr94y ago

This seems like a very interesting approach to creating an AI that can continuously learn new things by just updating its database. Maybe a first step towards a general purpose AI? It would be interesting to create a personal assistant based on this whose database was fed the entire digital stream generated by a persons life. How would you protect such an AI from misuse? Add another AI with a database of information on ethics that acts as a gatekeeper? Could you somehow keep the gatekeeper from being turned off, perhaps by using cryptography in some fashion for access control?

lebuffon4y ago

Could we say that they are re-inventing the human mind architecture by enhancing "fluid intelligence" with "crystallized intelligence".

As humans age we apparently lose the former but compensate with the latter as best we can.

mik094y ago

oftentimes one can shrink a model down dramatically once one has a bigger, more robust model. but shrinking a huge model is still a great achievement.

charcircuit4y ago

So where can we download these models?

rocgf4y ago

Given that this is DeepMind and not some more open AI organization, I assume you cannot.

kkjjkgjjgg4y ago

Sounds as if they stored all the correct answers in a database and call it "better". How do they even evaluate these models? Like they already have a billion preprepared correct answers in the database. How do they come up with new questions for the evaluation?

charcircuit4y ago

It's the equivalent of taking an a test where you can use the internet. Sure you know the information needed to answer the question exists, but it can be difficult to extract the answer and word it into at English sentence.

blovescoffee4y ago

Instead of storing the correct answers in an encoded/embedded form in the weights of the neural net (certain neurons very loosely corresponding to certain "answers") the correct answers are stored elsewhere. That way we can scale down the model to the necessary "thinking" parts and we don't need to use excess neurons for the "memory" part. Kind of handwavey but hopefully that explains the general idea.

kkjjkgjjgg4y ago

You mean otherwise the whole words would be encoded in the net, and now you only need to encode the index in the database?

tasty_freeze4y ago

> all the correct answers

That is clearly not possible, so it can't be what they are doing.

Rather than diffusely encoding that knowledge in a massive number of self-organized layers of weights, it is explicitly encoded. The remaining network can "focus" on mapping input to retrieve the relevant information stored in that database, and extracting/interpolating/extrapolating that information based on the current context to generate useful output.

j / k navigate · click thread line to collapse

132 comments

whazor4y ago

endymi0n4y ago

Exciting times.

GistNoesis4y ago

The real problem with (NVMe) SSD is that they have a limited number of write cycles (a max TB written).

NVMe SSD can write at 6GB/s but can only write ~800TB that's about 37 hours of lifetime at max speed.

iggldiggl4y ago

> Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte.

"Only" 825 GB actually: https://pile.eleuther.ai/

1 more reply

sbierwagen4y ago

>48GB ram

48GB VRAM? 48+ gigabytes of system ram is cheap, 48 gigabytes of ram on a GPU is still painfully expensive.

moffkalast4y ago

Yes, in this GPU market that's essentially a new car's worth of cash.

1 more reply

JudasGoat4y ago

The Amd APU's would be interesting although under powered. They give you the option of setting "VRAM" size to almost any percentage of system memory.

schleck84y ago

GPUs are the slowing factor in general if I'm not mistaken when it comes to Deep Learning progress.

charcircuit4y ago

Yes, vram

1 more reply

spywaregorilla4y ago

Isn't this what watson used to do?

neom4y ago

robbedpeter4y ago

Yeah, memory implies persisted state in the model, this is static lookups separate from the transformer.

Still superb, though, there's no reason you can't use other gofai tools vs a static database, to trigger expert systems or formalized reasoning.

visarga4y ago

It's not gofai. It's locality sensitive hashing published in 2008.

1 more reply

neom4y ago

4 more replies

LoveMortuus4y ago

I wonder what we can learn from AI models about how humans work.

meiji1634y ago

I'd be interested to see if these models are robust against algorithms like TextFooler [0]. I'm skeptical this trend of 10x'ing the parameters will solve the "clever hans" problem.

[0]: https://github.com/jind11/TextFooler

amitport4y ago

Dup https://news.ycombinator.com/item?id=29486607

(This is a different blogpost, but does not seem to add over the original)

Edit: following derac's comment see https://news.ycombinator.com/item?id=29646112 for RETRO

derac4y ago

This article is about RETRO [0], not Gopher.

[0] https://deepmind.com/research/publications/2021/improving-la...

davefol4y ago

Seems odd to claim 25x reduction in size when the algo involves looking into a database of a trillion chunks of text.

visarga4y ago

The "algo" here refers to the neural net itself. The text index is considered an easy problem as you can do lookups in logarithmic time.

ehsankia4y ago

davefol4y ago

Thanks for this. Rereading + your comment and I think I have a better understanding of why this is progress.

AJRF4y ago

I've not kept track of where large transformers like this have gotten to, GPT3 and the like - has GPT3 made any real difference to the world? Are people using it? Has it vastly improved any software?

tveita4y ago

Github Copilot is definitely GPT-3-based and is seeing real-world use https://copilot.github.com

muzani4y ago

If anything, it's being used in force for social media marketing, where you're trying to say "buy this thing" in different ways every day.

regularfry4y ago

Forgive the ignorance, but how? What tools are you using on top of GPT3 to do those things?

3 more replies

wiz21c4y ago

Seconding someone else's comment : what is your workflow for those tasks ? How does it help you to read academic papers ? Or to put formatting on markdown ?

1 more reply

ggm4y ago

buro94y ago

Oh wow, that's a fun rabbit hole: https://www.theguardian.com/uk-news/2020/aug/26/shock-an-aw-...

bawolff4y ago

Note, the stuff written by said kid has long since been deleted. However i have no idea what the quality of the rest of scowiki is.

ggm4y ago

When your reference text says the language isn't yet well captured in a single print, you better believe the wiki page is a hot mess.

[0] https://dsl.ac.uk/about-scots/what-is-scots/

LudwigNagasena4y ago

Well, it is pretty hard to make something in a language when it is a dialect continuum and not a standardized variety that is forced onto the whole population through the education system and media.

1 more reply

jari_mustonen4y ago

Could someone explain the article to layman engineer?

visarga4y ago

It's language modelling with search engine in-the-loop.

Instead of training GPT-3 with 178B weights, you train a 25x smaller model and allow it to retrieve useful snippets from a large text index as additional information.

[1] https://github.com/spotify/annoy

alkonaut4y ago

2 more replies

jamesblonde4y ago

Blikkentrekker4y ago

I would assume that a.i.'s made in, say, Japan would yield different results.

b9a2cab54y ago

tsimionescu4y ago

The AI is the final product of applying a learning algorithm on a training set.

Claiming that "the AI is not biased, the training set is" is like saying "this running program isn't buggy, it's just the source code that is buggy".

mschuster914y ago

Identifying and ideally remediating such issues is why ethics research is so sorely needed.

Blikkentrekker4y ago

> And normally, that's harmless - as you said, you'd expect to see an AI finding pictures of houses in the region/culture you are searching it.

I am not actually; I am searching for “huis”, not “Nederlands huis”; I'd expect the result I obtain from the former with the latter.

b9a2cab54y ago

1 more reply

xmprt4y ago

Blikkentrekker4y ago

Yes, but to offset that, many a.i. in English were also made outside of English-speaking regions, in what one assumes to be proportional degree.

1 more reply

baxter0014y ago

I can only quote Joy Buolamwini on this:

b9a2cab54y ago

The answer would probably be yes if that subgroup wasn't a large percentage of the dataset used for training and testing. Or if that subgroup wasn't a large percentage of the user base.

1 more reply

baalimago4y ago

Okay. Now make the small AI with memory 25 times bigger!

rapjr94y ago

lebuffon4y ago

Could we say that they are re-inventing the human mind architecture by enhancing "fluid intelligence" with "crystallized intelligence".

As humans age we apparently lose the former but compensate with the latter as best we can.

mik094y ago

oftentimes one can shrink a model down dramatically once one has a bigger, more robust model. but shrinking a huge model is still a great achievement.

charcircuit4y ago

So where can we download these models?

rocgf4y ago

Given that this is DeepMind and not some more open AI organization, I assume you cannot.

kkjjkgjjgg4y ago

charcircuit4y ago

blovescoffee4y ago

kkjjkgjjgg4y ago

You mean otherwise the whole words would be encoded in the net, and now you only need to encode the index in the database?

tasty_freeze4y ago

> all the correct answers

That is clearly not possible, so it can't be what they are doing.

j / k navigate · click thread line to collapse