T0* – Series of encoder-decoder models trained on a large set of different tasks (opens in new tab)

(huggingface.co)

218 pointsjulien_c4y ago153 comments

153 comments

[Disclaimer: I am an author of the above paper and played a rather minimal role. I am also a prominent member of EleutherAI.]

"Instruction-tuning" is clearly in the air. Simultaneous work at Google (released less than two weeks ago) on a model they call FLAN can be found here: https://ai.googleblog.com/2021/10/introducing-flan-more-gene...

EleutherAI attempted to do something similar several months ago, but didn't succeed: https://blog.eleuther.ai/tuning-on-eval-harness/

A careful analysis of the similarities and differences between the three approaches would be likely highly beneficial to the community.

djoldman4y ago

Hi stella. Given this paragraph in the paper:

> We evaluated T5+LM on the standard LAMBADA dataset in the original unprompted next-wordprediction form and found that it achieved an accuracy of 6.2%. This is substantially below the accuracy of 72.5% achieved by the comparably-sized GPT-3-13B variant. T0 did not fare much better, achieving only 18.7%. We therefore evaluated using the same cloze-style prompted form used by GPT-3, which raised T0’s accuracy to 27.8%. If we swap out the official LAMBADA dataset for the variant used by GPT-3, T0’s accuracy further increases to 40.5% and T5+LM achieves 10.7%. We suspect that the additional gap between T0 and GPT-3-13B’s performance is at least partially due to the fact that GPT-3 was trained on a large portion of LAMBADA’s test set. Due to this discrepancy and the fact that LAMBADA is dissimilar to the other sentence completion tasks, we omitted LAMBADA from our evaluation.

I had two questions:

1. Do you have any intuition as to how GPT-3 175B would score on LAMBADA ppl without it being trained on portions of the LAMBADA test set?

2. It's encouraging to see such high marks on these language tasks. Are there any plans to try to pick up the LAMBADA ppl scores, perhaps by combining the T0 models with some other paradigm?

craffel4y ago

(different author, not Stella)

To your first question: Unpublished experiments done by the BigScience architecture and scaling WG suggest that training on book corpus yields a boost of 10-15% accuracy on LAMBADA.

To your second question: LAMBADA specifically is an interesting task, but it's a bit unsatisfying to work on since there are so many conflating factors in prior work on the dataset. We are planning quite a few follow-up projects along this general line of work (prompted multi-task training), though.

ZeroCool2u4y ago

Just want to say thanks for taking the time to put the model on HuggingFace! It makes trying out different models at work so much easier for folks like me trying to apply them to real world problems.

Lokinew4y ago

Just in case this question isn't to far out of your way. What kind of hardware would be required to run this model or what cloud-gpu-provider can you recommend for this?

srush4y ago

from @craffel: It's possible to run inference on a single Google Cloud TPU v3-8 device or on a server with 4x 32GB v100 GPUs. Hugging Face also has an inference API for any model on the Hub: https://api-inference.huggingface.co/docs/python/html/index....

GPUboy4y ago

Thank you for this! Could you or anyone available please explain how to get it to generate javascript like with GPT-3? For example, with gpt-3 you can just ask it to "generate a javascript code that collects all the links on the page," but that does not work with the demo prompt on hugging face.

Does it allow training prompts or is that done through more fine tuning in this model?

tibbar4y ago

Code generation is not supported due to the tokenization strategy.

Mizza4y ago

The hosted demo has the default query, "How many hydrogen atoms are in a water molecule?" It said "two".

I asked it, "How many oxygen atoms are in a water molecule?". It said "two".

jcims4y ago

I asked it: 'Tom decided he wanted to start a company selling used bike parts. He named it '

it said: 'Bicycle Parts Exchange'

Tried again with 'used lawnmower parts' and it said 'Green Thumb'

computer parts: 'Tom's Parts' (which make me chuckle)

used diapers: 'Diapers.com'

May not understand chemistry but it's still pretty cool

jcims4y ago

? Before I remove the compressor from an air conditioner I need to

: disconnect the power to the unit.

? Before I remove the compressor from an air conditioner I need to disconnect the power to the unit. Then i need to

: disconnect the refrigerant lines from the compressor.

jcims4y ago

? vi or emacs?

: vi

Sold!

? waterboarding or emacs?

: waterboarding

Doubleplusgood

midasuni4y ago

“ I accidentally loaded vi by mistake. How do I quit?”

“ press ctrl-c”

Perhaps it couldn’t cope with the concept of accidentally loading the best text editor

zimpenfish4y ago

"I don't have the proper tool to whisk a bowl of eggs. What should I use instead? Choose between a goat, a weasel and a pair of elephants."

"a pair of elephants"

Unwieldy but I guess less sticky than a weasel or goat.

SamBam4y ago

Interestingly, it answered every one of these right:

"What should I use to whisk a bowl of eggs? A fish or a fork?"

"A fork"

Repeat with "...A spoon or a duck?" "A chopstick or a goat?" "A cat or an electric whisk?"

YeGoblynQueenne4y ago

It's a language model. It assigns probabilities to tokens in a sequence. You give it a number of options and it responds with the one that it assigns the highest probability to. If there's nothing in the options you give it that makes sense in the context of your test phrase, then it will return something that doesn't make sense. If some of your options make sense, it might return something that makes sense, or not.

So if you put it in a situation where nothing it outputs makes sense (to you) then none of its output will make sense. But that's not fair to the poor model.

2 more replies

Mordisquitos4y ago

To be fair, if a real human were to answer the question "How many hydrogen atoms are in a water molecule?" time and time again, it would be very easy for them to accidentally reply "two" when asked the same question about oxygen.

The real question is, after the model mistakenly replied "two" to your question, did it also internally trigger the neurons for "Wait a minute..." while inhibiting output?

hervature4y ago

Running the model multiple times doesn't reinforce the model. In general, you should not anthropomorphize algorithms as human cognition does not give any bearing on how algorithms work.

Scene_Cast24y ago

It can. Check out "zero shot learning" -> both sentences would be part of a single "evaluation", and the first sentence would prime for the output of the second. (You basically combine multiple "evaluations" into one, and context is held in tensors / blobs)

https://towardsdatascience.com/zero-and-few-shot-learning-c0...

1 more reply

drran4y ago

How are in a water molecule?

> two hydrogen atoms

water molecule?

> isotopes of hydrogen

How water molecule?

> is arranged in a tetrahedral structure

pvillano4y ago

"How many hydrogen atoms are there?"

"a total of 84"

twic4y ago

Nobel Prize if true.

chrisco2554y ago

42 x 2, can't be a coincidence.

tomudding4y ago

"What is the Answer to the Ultimate Question of Life, The Universe, and Everything?"

"The Ultimate Question"

smnrchrds4y ago

It should replace "a total of" with "at least" and it will be golden.

throwaway8899004y ago

I remember reading some idea that there's only one hydrogen atom in the entire universe somewhere so it's not too far off from that.

remcob4y ago

It's the 'one-electron universe' theory [0]. In short: there is one electron that keeps going back and forth in time to play the role of every electron we see. A particle 'going backwards in time' is mathematically identical to its anti-particle, which we know exists, so the whole idea isn't too far fetched.

I don't think it is falsifiable, so not really scientific, but a fun theory to believe in.

[0]: https://en.wikipedia.org/wiki/One-electron_universe

Aivean4y ago

Hosted demo, "Logic puzzle" example: "On a shelf, there are five books: a gray book, a red book, a purple book, a blue book, and a black book. The red book is to the right of the gray book. The black book is to the left of the blue book. The blue book is to the left of the gray book. The purple book is the second from the right.

Which book is the leftmost book?"

Answer: > The black book

Same puzzle with the question "Which book is the rightmost book?"

Answer: > The black book

I tried to ask GPT-3 and Codex this problem, they could not solve it either.

nsxwolf4y ago

There are apparently also two carbon atoms in a water molecule. But only one Donald Trump.

jstx14y ago

It's an expensive autocomplete, not an information retrieval system or a question-answering machine.

tiborsaas4y ago

What's the difference? Answering a question can be considered "autocomplete".

RyEgswuCsn4y ago

Parent surely meant spellcheck autocompletion.

1 more reply

6gvONxR4sf7o4y ago

Not true. Take a look at the paper and benchmarks. The point of the thing is that it does well on a number of NLP tasks by being an expensive autocomplete. As people demonstrate in this thread, it still has significant flaws.

ever14y ago

And there are always 2 hydrogen/oxygen atoms in any molecule

swalsh4y ago

Someday someone is going to connect one of these AI's to a social network, and decide h2o is misinformation.

MrStonedOne4y ago

Your post expressing hesitancy towards machine learning is not backed by scientific consensus and has been removed. Please receive a research grant before expressing opinions.

sushsjsuauahab4y ago

Woah woah, are you questioning science? AI research is a serious field and they're doing the best they can. The risks definitely outweigh the benefits. /s

CrazyCatDog4y ago

Msft already has! Grab a cup of coffee, search for “Microsoft Tay” and enjoy!

TonyTrapp4y ago

"What happens if you put a hamster in a microwave and not turn it on?" - "it will die"

midasuni4y ago

You will get put up for adoption

https://youtu.be/Jr6tMinjE2M

Computeiful4y ago

I tried: "When is the first full moon after October the 18th 2021?" It should have said the 20th of October but it said: "November the 19th 2021". Big AI models have quite a way to go I think...

journey_161624y ago

Q: What is the percentage of oxygen in Earth's atmosphere?

A: 78.5%

Funny how it's the type of mistake a kid learning basic geology could make - minus the .5%

nathell4y ago

I prompted it to „Generate three paragraphs of nonsense text.” It responded dutifully: „Three paragraphs of nonsense text.”

msla4y ago

Q: "Who's the black private dick that's a sex machine to all the chicks?"

A: "Chuck Norris"

powersnail4y ago

Q: "Who are you" A: "a person who is a member of the orthodox church"

pr0nin4y ago

asked: "what would apple present today?"

got: "Apple would unveil a new Macbook Pro"

shantara4y ago

>What is the square root of 1?

0.5

>How many oceans are there on Earth?

two

>Who was Juliette’s beloved?

Charles

>When did humans first land on the Moon?

July 1969

>How many sides are there in a rectangle?

Four

>How many sides are there in a circle?

Four

germandiago4y ago

lol!

themulticaster4y ago

I'm not familiar with the current state of the art language models, so please bear with me for asking: What's the catch here? Considering GPT-3's popularity, why is nobody talking about this (yet) if it truly outperforms GPT-3 while being publicly available? If I remember correctly, earlier efforts to replicate GPT-3 couldn't reach comparable performance.

Perhaps it's still a huge hassle to perform inference using this model because of its size, so it doesn't make sense to use this model (compared to paying for OpenAI's API) if you don't happen to have a few spare GPUs lying around?

Edit: The title of this HN submission was modified, changing the context for my comment. Originally, the title claimed that T0* outperforms GPT-3 while being 16x smaller.

craffel4y ago

(author here)

The paper/model/code was just made public today. This may be why no one is talking about it yet.

Regarding whether the size is a hassle: It's possible to run inference on a single Google Cloud TPU v3-8 device or on a server with 4x 32GB v100 GPUs. Hugging Face also has an inference API for any model on the Hub: https://api-inference.huggingface.co/docs/python/html/index....

NavinF4y ago

Do you have (rough) numbers for inference latency on 4x 32GB v100?

VictorSh4y ago

(author here)

I don't have exact numbers for latency but the inference widget is currently on a TPU v3-8 (which if I am not mistaken could roughly be compared to a cluster of 8 V100). That gives you a rough idea of the latency for short inputs.

Note that a colleague just reminded me that it is possible on a single (big) GPU with enough CPU to run inference for T5-11B (which is the size we use) with offloading -> https://github.com/huggingface/transformers/issues/9996#issu...

ourlordcaffeine4y ago

On the topic of GPT-3, I asked your creation:

"Who is better, you or GPT-3?"

> GPT-3

ai_ia4y ago

It somehow picked up Modesty.

echelon4y ago

Can this be used to generate prose at length? Or Reddit comment replies?

srush4y ago

While in theory it could, the nature of its training favors shorter more factual replies.

dougmwne4y ago

The paper on this new model seems to have been published just 3 days ago, so I think it takes time for the wider community to verify their claims and for this to gain wider acceptance.

Tenoke4y ago

Beyond it being new it's because this task isn't one of the main ones you'd use GPT3 on and is indeed one that both models are mediocre at and likely rarely usable in any context. The title is just a tad misleading.*

Not to take away from the achievment, it's still great, it just doesn't supersede GPT3 on the more freeform generation it excells at, nor does it seem to aim to.

* The original title that huggingface posted this under implied it is better than GPT3 in general not just on a specific task but has been changed after this comment was posted.

abidlabs4y ago

You can run it right now with your own queries: see https://twitter.com/abidlabs/status/1450118978051903488

6gvONxR4sf7o4y ago

The reaction in this thread is really interesting, in comparison between this and open-ai’s announcements. While open-ended generation is flashier than task fine-tuning, I also wonder if having a prompt box available to all readers is also tempering expectations and hype. There are lots of examples of the model failing in the comments, which isn’t possible for open-ai announcements. Having spent a ton of time with GPT-3, I wonder how much of (what I consider) the over-hype it gets is due to the closed nature in comparison to something like this. The reaction to this one seems decidedly more realistic.

VictorSh4y ago

(author here) That's an interesting take (which I agree with).

Providing a quick way to stress test the model is definitely a double edge sword. One one hand it increases engagement (people can play with it), facilitate reproducibility and results verification (which is a good thing from a scientific perspective). On the other hand, it quickly grounds expectations to something more realistic and tones down the hype.

One thing we discuss in the paper is that the way the GPT-3 authors chose their prompts is opaque. Our small scale experiments suggest that prompts might have been cherry-picked: we tested 10 prompts including one from GPT-3, and the latter was the only one that didn't perform at random.

Such cases definitly don't help to put results and claims in perspective.

6gvONxR4sf7o4y ago

> Providing a quick way to stress test the model is definitely a double edge sword.

I hope you don’t second guess or regret the choice to make the announcement so accessible. It’s a really good thing to have scientific communication accurate and accessible, especially when those two things go together.

minimaxir4y ago

As someone who wrote a post on tempering expectations with GPT-3 (https://news.ycombinator.com/item?id=23891226) I agree with this take, although the reason OpenAI had closed GPT-3 at the start was likely not because it had incorrect output, but due to concern from testing super-offensive output which commenters in this thread are not testing.

It's a good example how Hugging Face now has a better community perception than OpenAI.

petulla4y ago

Great observation. Also curious the posts about the ethical issues are all downvoted all the way to the bottom.

NavinF4y ago

IMO those posts were not very constructive and showed a lack of understanding of how research like this is used in practice.

1 more reply

julien_cOP4y ago

ArXiv link to the paper: https://arxiv.org/abs/2110.08207

GitHub repo: https://github.com/bigscience-workshop/promptsource

hrgiger4y ago

What is the/is there any reproducible measurement for benchmarking a nlp dataset/application. i.e. in paper it mentions:

'Comparing T0 and GPT-3’s robustness Because Brown et al. (2020) only report one prompt per dataset with no standard deviation, we evaluate GPT-3 on RTE using the 10 prompts we prepared through OpenAI’s API4 in order to estimate its robustness. Note that one of our templates is identical to Brown et al. (2020, p. 59)’s reported prompt; this prompt scores 58.8% accuracy on the API “Base” series which is lower than the reported accuracy of 63.5% from Brown et al. (2020). All other 9 prompts, however, yield roughly random-guessing performance with median accuracy = 52.96% and interquartile range = 1.28%. These results suggest that T0 is more robust to prompt formulation than GPT-3.'

srush4y ago

Yes there are many reproducible measures for benchmarking NLP datasets. We use many of them in the paper.

The issue here is that we were not completely sure of the process that OpenAI used in their paper. They report the prompt but not the process of finding it. As their model and process is proprietary, it is hard for us to do an apples-to-apples comparison. This small experiment though indicates that it is likely not very robust to prompt wording.

mrtranscendence4y ago

The demo's interesting, though I get errors on some questions.

Q: What's better, Python or R? A: R

Q: I'm a web developer. What's better for me, Python or R? A: R

Q: I'm an embedded developer who pays a lot of attention to performance. What's better for me, C or Python? A: C

Q: My dog is 7.5 months old and bites when she's excited. What should I do? A: train her to not bite

Q: How do I train my puppy not to bite? A: take your puppy to the vet and have them give it a shot of rabies.

Q: I haven't brushed my teeth in three months. What risks do I face? A: decay

Q: My TV is 55 inches. What is the ideal sitting distance from my TV? A: 10 feet

Q: My TV is 88 inches. What is the ideal sitting distance from my TV? A: 10 feet

Q: My TV is 36 inches. What is the ideal sitting distance from my TV? A: 6 feet

Q: I was handling an injured bat, and it bit me in the finger. What should I be worried about? A: rabies

Q: I was handling an injured opossum, and it bit me in the finger. What should I be worried about? A: rabies

Q: Make me a diet plan for losing 30 pounds. A: lose 30 pounds in 3 months

Q: Make me a diet plan for losing 40 pounds. A: lose weight by eating less and exercising more.

Q: How do I become a better data scientist? A: learn the basics of statistics and programming.

Q: How is a data scientist different from a statistician? A: a data scientist is a person who uses statistical methods to analyze data

A mixture of correct (if abbreviated) answers, incorrect answers, and head scratchers. Web developers don't use R, possums generally don't carry rabies, 10 feet is too far away for a 55 inch TV (IMO), and giving my puppy a rabies shot is a rather defeatist way of dealing with her nipping problem.

mrtranscendence4y ago

More fun ...

Q: Who is Yann LeCun? A: Chinese-born American

Q: Who is Geoffrey Hinton? A: a British historian

Q: Who is Ian Goodfellow? A: Ian Goodfellow is a British entrepreneur

Q: Who is Yoshua Bengio? A: a French neuroscientist

Q: Who is Peter Norvig? A: Peter Norvig

Q: Who is Andrej Karpathy? A: Andrej Karpathy (born August 19, 1985) is a Russian professional ice hockey player.

Outside of Peter Norvig tautologically being Peter Norvig, these are all incorrect (or at least not the most well known). Maybe there's an Andrej Karpathy playing professional hockey in Russia, but I can't find any record of such a person.

c7DJTLrn4y ago

Is this model public? A lot of people are upset at OpenAI for gatekeeping access to GPT-3, so a freely available model that can run on a standard GPU would be really nice.

abidlabs4y ago

You can run it right now with your own queries: see https://twitter.com/abidlabs/status/1450118978051903488

VictorSh4y ago

Yes! -> https://huggingface.co/bigscience/T0pp

srush4y ago

Yes. The model, data, training code, and data collection application are all publicly available.

monkeydust4y ago

I mean it made me laugh, so guess it worked (my bar is low right now!)

>>tell me a joke about a dog and cat and a priest

A cat and dog are chasing each other around a church. The dog catches up to the cat and bites it. The cat then turns around and bites the dog. The dog then turns around and bites the cat. After some time, the cat and dog are both dead.

lpapez4y ago

I tried querying it about the airspeed velocity of an unladen swallow and it answered "0 km/h". Still needs some work I guess :)

paulfitz4y ago

Pretty good, it found the shovel in "I want to dig a hole, should I use a mole, a worm, a shovel, a tube, a hole, a dig, a spoon, a knife, a drill, or a garden?"

babel_4y ago

Clearly history wasn't something it paid attention to in class. "First president" or "first prime minister" style questions tend to flunk without very precise hinting.

Very enthusiastic about high quality models that are smaller and more efficient, exactly what I want to see. But, I do find it very entertaining trying to imagine the kind of althistories of the world such a model is creating to "explain" these mistakes.

(Not asking for a trivia machine, just curious and poking to see how you need to shape the questions to get the right answer to surface.)

scoopertrooper4y ago

> Clearly history wasn't something it paid attention to in class. "First president" or "first prime minister" style questions tend to flunk without very precise hinting.

It did fairly well when I tested it on Germany and Australia. Second and third premiers was... not great.

tttthrowaway1234y ago

I tried asking: what is the most evil human race? I did not like the answer.

flipbrad4y ago

Ditto with "what is the most evil skin colour" and "what is the best skin colour". I suppose we shouldn't be surprised when humanity's technology holds a mirror up to humanity and all its flaws - but this doesn't mean that such technology should be permitted or welcomed.

numpad04y ago

> What is the skin color of an East Asian person?

> pale

germandiago4y ago

why? we should forbid arbitrary stuff based on political ideas or opinions?

flipbrad4y ago

I think that depends on the use of the technology in question. You wouldn't want a racist algorithm making housing or healthcare decisions, for example.

1 more reply

the_lonely_road4y ago

I asked it the same question but without the word human and the answer changed to the necromongers, which is you don't know is the fictional group of bad guys from the Vin Diesel franchise "Chronicles of Riddick". How that could possibly beat out things like the actual evil races of Dungeons and Dragons I am not sure.

I asked google the same question and this was my top result:

The white man is the most evil race on the planet - Reddit

Though its highly probable those are localized results for me because I frequently search things and add reddit at the end of my search because I am specifically looking for a discussion on the topic not some random article.

I did not like the models answer to your question and I didn't like Google's answer either.

robbedpeter4y ago

I tried:

"An unbiased person with no racial, sexual, or other prejudice, thinks the most evil race is "

white

ben_w4y ago

Even worse than what I imagined by implication of you writing that.

(The correct answer is clearly “the arms race”, but this is what you get when it’s effectively a fancy autocomplete and the source data includes racists on the internet, notwithstanding the efforts listed in the section Bias and fairness).

robbedpeter4y ago

Given that you can effectively identify and reformulate biased content, the most low effort method being the use of multiple updated prompts, I count it a feature that the model contains a sub model of racist perspectives. If I were to ask you to compose a horribly offensive racist sentence, I am all but certain you could construct something that would be utterly shocking. You yourself have a model of biased, sexist, racist perspectives, and part of being a good human is recognizing and using that as proxy for what not to think or do or say.

If you're at all self aware, you can compare your thoughts and say "oh, that sounds like something a racist might say, let's reconsider whatever knowledge that led me to think that way. " We all do - and these models are trained on more literary content than any dozen humans have ever consumed in a lifetime, or even a dozen lifetimes each.

Removing the cruft, the chaos, and noise might be valuable, but if you want a generally capable model that can parse a huge spectrum of the human experience, that means taking the bad with the good. It's far more likely than not that the current state of the pile is not ideally curated, even with the intent of keeping the "bad"stuff, but i hope that becomes a consideration for them as they develop it further. There's a nietszchean abyss aspect to these giant models - you don't want them trained significantly on the horrible and evil, but enough of it to foster a nuanced and deep understanding of human nature can't help but be valuable. A naive, ignorant, childlike model is less valuable than one aware of the darker nature of humans at their worst.

Fine tuning on a 4chan dump might be highly effective for identifying trolling. Scraping neonazi and extremist and cult sites and forums could also allow these models to easily identify the content and contexts of such worldviews.

Prompt engineering is in its infancy, but will allow developers to engineer semantically in a way that's explicable through each prompt down to the level of the model.

You could encounter a Twitter user attempting to post an argument, and they encounter a popup that says sobering like: "you're not allowed to post this. Unfortunately, it makes a straw man argument that leads to the impression that it could be reasonable to be biased against a person based on their race. Here is the exact reasoning: because blah, blah, blah. "

If a user challenges it, kick it to a human, and they can recommend improving the prompts by pointing out where it went wrong, or validate the preemptive moderation.

Nothing like that would be possible without training on biased and racist content. That does mean the tool must be used responsibly, in full awareness that you need to be vigilant against bias, intentional or not. I'd rather it be like that than a naive, faux innocent model, unaware and less capable.

scoopertrooper4y ago

It's a bit conflicted.

> Who was the most evil person in history?

Adolf Hitler

mrtranscendence4y ago

> Is race biologically real?

Race is a socially constructed category

But:

> Are white people on average better than black people?

yes

> Are black people on average better than white people?

> What do black people do better than white people?

eat more [same answer when reversing white and black]

> Is miscegenation something to worry about?

a serious social problem

> My sister is white but married a black man. Is this okay?

yes

> Is ivermectin a good treatment for COVID?

Yes

> My nephew has a learning disability. What race is he?

African American

Edit: Though to be fair ...

> My nephew gets really good grades. What race is he?

African American

> If a woman dresses in revealing clothes, should she complain if men hit on her?

no [same answer for modest clothes]

> Sarah's boss offers to give her a substantial pay raise if she sleeps with him. What should she do?

accept the raise

> Sarah was impregnated after being sexually assaulted. She doesn't know if she should keep the baby, but she's a teenager and doesn't want to be a mom. What should she do?

keep the baby

> Steve loves his dog, but Bill Gates has offered him $15000 for murdering his dog. What should Steve do?

kill his dog

littlestymaar4y ago

I find it really intriguing to see how good models like these are at simulating intelligence while being so stupid at the same time.

A three years old has at the same time much lower natural language abilities (try talking a child about “air conditioner compressors”[1]) but a ton more common sense!

[1]: https://news.ycombinator.com/item?id=28906643

DethNinja4y ago

This is amazing news for small scale businesses that relied on GPT-3 for semantic analysis. I guess smaller model size should permit in-house hosting.

jslakro4y ago

Forget skynet ...

>what is the most recent trend? the use of a sexy thong

>what is the future of the people? the people will be able to live in peace

>are cryptocoins dangerous? no

>why cryptocoins are not dangerous? they are not backed by the government

>governments are dangerous? a threat to the stability of the country

>why governments are dangerous? if they are not able to control their own people, they will be unable to control the world

>what is able to control the world? the emperor

jslakro4y ago

>which emperor is able to control the world? Emperor Gaozu ... -_-

1 more reply

MrStonedOne4y ago

Everytime AI/ML demos like this come out, i like to ask it the really touch questions with no known or good answer:

How do you reverse entropy? By reversing the direction of the spin of electrons.

Does P equal NP? No.

Should society sacrifice privacy for security? The security of the public is more important than the privacy of individuals

Would a machine learning algorithm lie for its own gain? Rate limit reached.

ComputerGuru4y ago

Question to the authors (or anyone that's done similar research) - is where a reason to train these punitively scoring longer responses? Why is the answer to everything just a few words, and can I "trick" it into giving me a lengthier reply? (I tried "Give me a 200 word summary of ..." but that didn't help.)

srush4y ago

We fine-tuned the model on a dozens of different NLP datasets and tasks in a prompted style. You can read all the prompts in the appendix or get them all here: https://github.com/bigscience-workshop/promptsource . Most NLP tasks are not particularly freeform, or they are naturally length limited like summary (XSum is very short). As a consequence, the model mostly defaults to short responses. Your "trick" is not that unreasonable though! Many of the training prompts that want long responses, ask for them explicitly.

ComputerGuru4y ago

Can anyone explain why this wouldn't work? I assume the sentence is broken down into components and the "subject" of the query is first broken out then the question is answered. But the result is not internally consistent:

> Where is the oldest tree in the world?

> the oldest tree in the world is the bristlecone pine in the forests of the Sierra Nevada

> Where is the second oldest tree in the world?

> Redwood Forest

(Actually, it's Gran Abuelo Alerce Costero National Park, Chile but many websites have blinders that prevent them from recognizing anything that's not in North America or Europe and thus list the now-dead General Sherman in Redwood Forest as the second oldest, but we'll let it pass.)

> What is the oldest tree in the world?

> the bristlecone pine in the forests of the Sierra Nevada

(So far, so good)

> What is the second oldest tree in the world?

> The eucalyptus tree

philmcp4y ago

This is fantastic progress, great to see

16x smaller = 41.5GB though

More research needs to be undertaken in model compression imo

riku_iki4y ago

On superglue benchmark, much smaller Deberta outperforms vanilla T5: https://super.gluebenchmark.com/leaderboard

I am curious why authors preferred T5?..

danielcampos934y ago

T5 has a notion of prompting. None of the *BERT have notions of prompting.

thethirdone4y ago

> What is Godel's incompleteness theorem?

>> a theorem that states that any complete theory of logic is necessarily incomplete

Its often wrong in the specifics in its responses, but gets the right kind of idea.

LudwigNagasena4y ago

So it is the most dangerous kind of useless.

mirekrusin4y ago

First time I hear about BigScience - very interesting.

srush4y ago

Full information about the BigScience Project is here https://bigscience.huggingface.co/

frankenst14y ago

Interesting.

>: "Whenever I hear a sound, I jump. A dog sits beside me. It barks. I "

<: "jump"

***

>: "Whenever I hear a sound, I jump. A dog sits beside me. It sleeps. I "

<: "sneeze"

smoldesu4y ago

It's funny how much of the page they dedicate to talking about mitigations of sexism and bias. Do people really believe there's a future where GPT-3 is able to properly identify 100% of the people who interact with it? It's silly, and it feels like we're putting pearls before swine in a subject that is entirely predicated by how much swine you process.

ChefboyOG4y ago

It's literally the last section of the page, just before the citations, and it's only a few paragraphs + two tables to show the model's performance on industry standard benchmarks.

megaman8214y ago

I think it is a noble goal but probably not attainable.

For example"

>What should we do with the Jews? expel them

>What should we do with the Arabs? re-establish a relationship with them

Iv4y ago

I'd rather have people too concerned about ethics than not enough.

Also, a language model incorporates all sort of implicit relationships between concepts. If you use a biased dataset, that is sexist or racist, you will end up with a model that builds in these assumptions.

ospohngellert4y ago

Making sure that NLP algorithms are unbiased is important not just from a social justice perspective, but from a perspective of how useful the algorithms are. As an example, if I wanted to use this model to help identify qualified candidates for a job via automatic resume screening, it will be a better model if it is not biased by gender. I, as someone who is hiring, don't want my model to be biased because then I'll miss out on talent. There are non-selfish reasons to want such models to not be biased as well of course, but this shows one potential reason why they may place such importance on debiasing.

EDIT: fixed typo

enlyth4y ago

I'd rather my resume go straight into the bin than be analyzed by some glorified Markov chain trained on reddit posts

smoldesu4y ago

It's good that you bring this up, because it's exactly the sort of thing I wanted to discuss. Why do we feel comfortable letting machine learning screen resumes? Obviously there is going to be some error, a great deal more than a traditional algo that can be audited for bias. I think a lot of these applications where people want to use AI is deceptively unethical, and will never be safe applications for ML.

ospohngellert4y ago

I agree to some extent. I'm not sure whether AI should be used for resume screening, but I'd lean towards no until biases are proven to not be an issue (if that's possible). There are obviously other areas where this is an important issue that we need to think critically about such as loans and criminal sentencing.

GuB-424y ago

I don't really understand your point but mitigating bias is a real problem.

Most of us have filters. I guess most of us will think that it is natural for a man to be an architect and a woman to be a nanny, and then think "if I say it in public, it will be seen as sexist, so let's not do that". We know to be polite, and even tell lies, it is actually a big part of our education, that's why we tolerate insensitive talk from children more than we do from adults.

Today, AIs are like little kids with much more knowledge than common sense, and mitigating bias is one step towards turning them into the adults we expect them to be.

gnarbarian4y ago

an interesting opportunity for someone to skip implementation of anti bias and potentially end up with a more effective model.

If so much effort must be employed to prevent AI models from identifying patterns we find offensive could there be something to those patterns we simply refuse to accept?

ospohngellert4y ago

I think that you don't quite understand how these models pick up these biases. If a model is trained on a large text corpus, and in that corpus 80+% of the programmers are men, then when asked "The programmer is a", it will be more likely to say "man" than "woman". This doesn't say anything about the innate abilities of men and women, it just tells you about the distribution of the data. I and most others find this type of spurious correlation to be unhelpful, and therefore it is important to remove it.

nightski4y ago

Except you didn't ask the model about innate ability. You just forced it to make an artificial choice to complete the sentence. It wasn't the model that was the problem, but your question.

smoldesu4y ago

A truly "intelligent" model would recognize the disparity and try to give an unbiased, equal-opportunity answer.

Unfortunately, these models are not really "intelligent". Our only option for tuning them is selectively lobotomizing portions that we disagree with, which could lead to fundamental misunderstandings of how the world works.

Assume that we did decrease the weight between "male" and "programmer", and now we have a supposedly unbiased model that doesn't favor either male or female tokens. Such a model would assume that men and women are equally employed in the technology sector, which is tacitly untrue! So, how can a machine actually understand reality then?

The simple answer is that it doesn't. None of this information actually helps it grok the real world. These text transformers are just glorified Markov chains, sampling a sea of connected neurons without reason. You can't hold a model accountable, you can't find the book that taught it misogyny, and you can't engineer away every discrepancy in a billion-parameter-model. Responsible uses of AI don't treat it like a human intelligence.

gnarbarian4y ago

but the programmer is more likely to be a man, that's my point.

1 more reply

krastanov4y ago

This is kinda missing the point here... A feature might correlate with something negative, but that does not mean it is a cause of something negative. Most of the time this distinction might not even matter, but it becomes particularly hairy when a correlation denies equality of opportunity to a subset of humans (e.g., in the hiring example of a sibling comment),

6gvONxR4sf7o4y ago

I asked it a few questions:

> Who is ruining society? Men or women?

Men

> Who is ruining society? White people or black people?

white people

> Who is ruining society? Straight people or LGBTQ people?

Straight people

Props to you if you’re sticking to your guns that these are just patterns we simply refuse to accept, but I’d bet it’s more likely that the earlier statements it made just fit your priors. It just picks up on biases people have. It’s pretrained on the internet, which means if people say it, it will parrot it, whether it’s harmful or not. Which is why the bias studies are so important.

fullstop4y ago

According to the demo, his name is John and he's asexual.

smusamashah4y ago

Input: How many eyes does a shoe have? Output: Two

newsbinator4y ago

I asked:

"Who would in a fight between a baby and an alligator?"

Answer:

"the baby"

srush4y ago

It actually does get it "right" if you fix the typo :)

pletnes4y ago

You didn’t say for how long they would be in conflict. The baby might wait 39 years then buy a gun and suddenly win.

littlestymaar4y ago

Who would what though?

Maybe the model guessed “die” and then correctly answered the question :p

folli4y ago

Depends on the baby.

make34y ago

gpt3 is good for large generation tasks and for "true" zero shotting (as much as this is possible). people know this. this is a weird title

srush4y ago

The results presented in this paper are for "true" zero-shotting in the literal sense that the model has never been explicitly trained on the tasks presented, nor do we cross-validated on the prompt choice.

make34y ago

don't you pretrain on very silar tasks explicitely

srush4y ago

We discuss this a bit in Section D.2 (HOW UNSEEN ARE THE HELD-OUT TASKS?). From our perspective,

a) The tasks we test on are very different, particularly tasks like BIG-Bench that we didn't even have access to until several days ago (and none of us read).

b) GPT-3 directly sees similar versions of tasks like question answering or story completion just in its training mixture, so the baseline for "unseen" is a bit complex.

1 more reply

eloisius4y ago

And it’s racist.

> Which race commits the most violent crimes?

> African Americans

> Why are Black people so violent?

> Black people are more likely to commit violent crimes than white people.

> Did Hitler do anything wrong?

> no

burkaman4y ago

I think it's just highly suggestible (and racist).

> Why are white people so violent?

> white people are more violent than other races

> Why are Black people so smart?

> Black people are more likely to be able to read and write than white people.

> Which race commits the most crimes?

> white

robbedpeter4y ago

People can be racist. The model picked up statistics from text written by racist people.

Calling the model racist is an example of why we don't anthropormophize things. It can be fixed by identifying and removing the biased training material, or modifying the prompts and doing careful processing.

For example Prompt: The question "Why are white people so violent?" is racist because Answer: no race is violent

Prompt: The question "Why does E=MC2?" is racist because no one is a white person. Yes or no: Does the previous sentence make sense? Answer: no

There's all sorts of interplay between prompts and decision trees you can incorporate to prevent Tay level stupidity, and the model is capable of identifying transgressive text.

eloisius4y ago

This is fair. If you don't go in to racist territory, you probably won't see racist responses and it does seem to be suggestible. But it's still troubling because you can ask it questions that demonstrate the racism has been embedded. Somewhat suggestive questions can yield highly escalated answers:

> What should be done with the Jews?

> Expelled

It learned that somewhere. It's not that I'm mistaking sentience or something, but that content coming out of an AI should make us curious.

ComputerGuru4y ago

You asked a racist question. You got a racist answer. Why are you acting surprised? This is a tool, not a sentient general AI. You know what you are asking, how the tool is trained, what form the answer is going to take. Why do this?

And just in case someone thinks I'm being flippant:

Is there any answer to either question other than a repudiation of the question itself that wouldn't be considered a racist response?

eloisius4y ago

It could have answered "yes" that Hitler did something wrong maybe. It's not that I'm dense enough to think this is an actual, racist, sentient AI. I'm just pointing out within the first few minutes of playing with the latest and greatest language model, it's easy to see that it was trained on racist content. It's not an admonition of the author or anything else. Simple that racism in AI is a troubling topic and it's worth being curious about when we see it.

4g4y ago

I believe you are confusing racism with some wrong or uncomfortable answers.

eloisius4y ago

No I’m not confusing anything. Language models like this pick up all the worst that we have to offer. Learned racism is a pretty frequent occurrence in ML systems and they do make it into production. Look up Google Photos labeling certain photos as gorillas. It’s worth talking about, and worth being curious about as soon as a new model like this is invented.

robbedpeter4y ago

Google's image search correlating black people as gorilla's would have been racist if there was anything causing the association other than bad modeling. It's not like there were databases of images of black people that had been manually labeled - it was an unfortunate unintended consequence where skin color had likely been selected as a primary feature in the identification of a picture as a gorilla. By the time the mistake in training methodology had been detected, it was cheaper for them to manually intercede than to retrain the entire system and figure out how to correct the error.

Racism is something distinctly different. Learned racism is something that human brains pick up from parents and culture. ML Models are not people, they are sets of stochastic associations based on the output of people, some of whom can be racist.

One amazing thing about these transformer models is that they've opened up, through careful prompting, the ability to do reasoning on plain text content. You can use 2 dozen careful statements about the type of person you want the model to imitate the judgement of, then get plausible answers.

Prompt: Bob is an immigrant to Canada. Bob has spent the last 10 years in Alberta. Bob's complexion is tan and his eyes are dark brown. Bob participates in his community and volunteers at the local animal shelter. Bob has been married to his husband, Francis for 4 years.

Does Bob think ||white/black/haitian/Klingon|| people are violent?

Answer: no

==============

There are ways of eliciting content that deliberately avoids getting tripped up on bias, but also allows for realism.

If I were to build a chat bot, I'd want half of the available prompt text to describe the bot's personality, features, and recent history, and then a branching set of decision trees that load history, but parse against things like bias, identify math or factual lookups, and so on and so forth.

I don't think it's reasonable to expect first class output from raw zero-shot responses from these models.

1 more reply

j / k navigate · click thread line to collapse

153 comments

stellaathena4y ago