AI hype is built on flawed test scores (opens in new tab)

(technologyreview.com)

204 pointsantondd2y ago230 comments

230 comments

mg2y ago

I don't think the "hype" is built on test scores.

It is built on the observation how fast AI is getting better. If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.

Just two years ago, I was mesmerized by GPT-3's ability to understand concepts:

https://twitter.com/marekgibney/status/1403414210642649092

Nowadays, using it daily in a productive fashion feels completely normal.

Yesterday, I was annoyed with how cumbersome it is to play long mp3s on my iPad. I asked GPT-4 something like "Write an html page which lets me select an mp3, play it via play/pause buttons and offers me a field to enter a time to jump to". And the result was usable out of the box and is my default mp3 player now.

Two years ago it didn't even dawn on me that this would be my way of writing software in the near future. I have been coding for over 20 years. But for little tools like this, it is faster to ask ChatGPT now.

It's hard to imagine where we will be in 20 years.

rob742y ago

The article doesn't say that LLMs aren't useful - the "hype" they mean is overestimating their capabilities. An LLM may be able to pass a "theory of mind" test, or it may fail spectacularly, depending on how you prompt it. And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future, but we're not there yet, and (AFAIK) nobody can tell how long it will take to get there.

denton-scratch2y ago

> And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future [...]

I don't think so. When you say "it's not capable of actually reasoning", that's because it's a LLM; and if it "changes in the future", that's because the new system must no longer be a pure LLM. The appearance of reasoning in LLMs is an illusion.

1 more reply

kromem2y ago

That's not the case. It's very much in the realm of "we don't know what's going on in the network."

Rather than a binary it's much more likely that it's a mix of factors going into results that includes basic reasoning capabilities developed from the training data (much like board representations and state tracking abilities developed feeding board game moves into a toy model in Othello-GPT) as well as statistic driven autocomplete.

In fact often when I've seen GPT-4 get hung up with logic puzzle variations such as transparency, it tends to seem more like the latter is overriding the former, and changing up tokens to emoji representations or having it always repeat adjectives attached to nouns so it preserves variation context gets it over the hump to reproducible solutions (as would be expected from a network capable of reasoning) but by default it falls into the pattern of the normative cases.

For something as complex as SotA neural networks, binary sweeping statements seem rather unlikely to actually be representative...

1 more reply

YetAnotherNick2y ago

> it's not capable of actually reasoning

Define reasoning. Because in my definition GPT 4 could reason without doubt. It definitely can't reason better than experts in the field, but it could reason better than say interns.

6 more replies

smaddox2y ago

> And that's because, despite all of its training data, it's not capable of actually reasoning.

Your conclusion doesn't follow from your premise.

None of these models are trained to do their best on any kind of test. They're just trained to predict the next word. The fact that they do well at all on tests they haven't seen is miraculous, and demonstrates something very akin to reasoning. Imagine how they might do if you actually trained them or something like them to do well on tests, using something like RL.

3 more replies

dinosaurdynasty2y ago

LLMs are trained to predict text, and one of the results of this is the LLM has as many "faces" as exist in the training data, so it's going to be _very_ different depending on the prompt. It's not a consistent entity like a human. RLHF is an attempt to mediate this, but it doesn't work perfectly.

jug2y ago

I’m often confused over claims on the reasoning capabilities. It is often mentioned in debates as a clear and undeniable issue with current LLM’s. So since this claim can be made, where are said tests about reasoning skills that GPT-4 fails?

If it’s a debate on the illusion of reasoning, I’d be careful how I step here, because it’s been found these things probably work so well because the human brain is also a biological real-time prediction machine and “just” guessing too: https://www.scientificamerican.com/article/the-brain-guesses...

CryptoBojack2y ago

Isn't that the same as for humans? If you are speaking with me (prompting), my answers will be differents, based on how you prompted me for an answer.

denton-scratch2y ago

> I was mesmerized by GPT-3's ability to understand concepts

This language embodies the anthropomorphic assumptions that the author is attacking.

dboreham2y ago

Or the corollary: that there's really no such thing as anthropomorphic. There's inputs and outputs, and an observer's opinion on how well the outputs relate to the inputs. Thing producing the outputs, and the observer, can be human or not human. Same difference.

2 more replies

gmerc2y ago

This.

We are in a Cambrian Explosion on the software side and hardware hasn’t yet reacted to it. There’s a few years of mad discovery in front of us.

People have different impressions as to the shape of the curve that’s going up and right, but only a fool would not stop and carefully take what is happening.

kossTKR2y ago

Exactly and things are actually getting crazy now. Pardon the tangent but for some reason this hasn't reached the frontpage on HN yet: https://github.com/OpenBMB/ChatDev

Making your own "internal family system" of AI's is a making this exponential (and frightening), like an ensemble on top of the ensemble, with specific "mindsets", that with shared memory can build and do stuff continuously. Found this from a comp sci professor on Tiktok so be warned: https://www.tiktok.com/@lizthedeveloper/video/72835773820264...

I remember a couple of comments here on HN when the hype began about how some dude thought he had figured out how to actually make an AGI - can't find it now, but it was something about having multiple ai's with different personalities discoursing with a shared memory - and now it seems to be happening.

This coupled with access to linux containers that can be spawned on demand, we are in for a wild ride!

2 more replies

adrians12y ago

> If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.

That's a big assumption to make. You can't assume that the rate of improvement will stay the same, especially over a period of 2 decades, which is a very long time. Every advance in technology hits diminishing returns at some point.

mg2y ago

Why do you think so?

Technological progress seems rather accelerated than diminishing to me.

Computers are a great example: They have been getting more capable exponentially over the last decades.

In terms of performance (memory, speed, bandwidth) and in terms of impact. First we had calculators, then we had desktop applications, then the Internet and now we have AI.

And AI will help us get to the next stage even faster.

1 more reply

james-revisoai2y ago

A lot of the progress in the last 3-4 years was predictable from GPT-2 and especially GPT-3 onwards - combining instruction following and reinforcement learning with scaling GPT. With research being more closed, this isn't so true anymore. The mp3 case was predictable in 2020 - some early twitter GIFs showed vaguely similar stuff. Can you predict what will happen in 2026/7 though, with multimodal tech?

I simply don't see it a being the same today. The obvious element of scaling or techniques that imply a useful overlap isn't there. Whereas before researchers brought together excellent and groundbreaking performance on different benchmarks and areas together as they worked on GPT-3, since 2020, except instruction following, less has been predictable.

Multi modal could change everything (things like the ScienceQA paper suggest so), but also, it might not shift benchmarks. It's just not so clear that the future is as predictable or will be faster than the last few years. I do have my own beliefs similar to Yann Lecun about what architecture or rather infrastructure makes most sense intuitively going forward, and there's not really the openness we used to have from top labs to know if they are going these ways, or not. So you are absolutely right that it's hard to imagine where we will be in 20 years, but in a strange way, because it is much less clear than in 2020 where we will be in 3 years time onwards, I would say it is much less guaranteed progress than it is felt by many...

huijzer2y ago

I was also thinking about how quickly AI may progress and am curious for your or other people's thoughts. When estimating AI progress, estimating orders of magnitude sounds like the most plausible way to do it, just like Moore's law has guessed the magnitude correctly for years. For AI, it is known that performance increases linearly when the model size increases exponentially. Funding currently increases exponentially meaning that performance will increase linearly. So, AI will increase linearly as long as the funding does too. On top of this, algorithms may be made more efficient, which may occasionally make an order of magnitude improvement. Does this reasoning make sense? I think it does but I could be completely wrong.

RC_ITR2y ago

You can check my post history to see how unpopular this point of view is, but the big "reveal" that will come up is as follows:

The way that LLMs and humans "think" is inherently different. Giving an LLM a test designed for humans is akin to giving a camera a 'drawing test.'

A camera can make a better narrow final output than a human, but it cannot do the subordinate tasks that a human illustrator could, like changing shadings, line width, etc.

An LLM can answer really well on tests, but it often fails at subordinate tasks like 'applying symbolic reasoning to unfamiliar situations.'

Eventually the thinking styles may converge in a way that makes the LLMs practically more capable than humans on those subordinate tasks, but we are not there yet.

ulnarkressty2y ago

Most of the improvements apparently come from training larger models with more data. Which is part of the problem mentioned in the article - the probability that the model just memorizes the answers to the tests is greatly increased.

AI is getting subjectively better, and we need better tests to figure out if this improvement is objectively significant or not.

tuatoru2y ago

> Most of the improvements apparently come from training larger models with more data.

OpenAI is reportedly losing 4 cents per query. With a thousandfold increase in model size, and assuming linear scale in cost, that's a problem. Training time is going to go up too. Moore's law isn't going to help any more. Algorithmic improvements may help...if any significant ones can be found.

Nevermark2y ago

That’s backwards.

Training a model on more data improves generalization not memorization.

To store more information in the same number of parameters requires the commonality between examples to be encoded.

In contrast, the less data trained on, especially if repeated, lets the network learn to provide good answers for that limited set without generalizing. I.e. memorizing.

——

It’s the same as with people. The more variations people see of something, the more likely they intuit the underlying pattern.

The fewer examples, the more likely they just pattern match.

1 more reply

qup2y ago

I pretty much want the LLM to be great at memorizing things. That's what I'm not great at.

If it had perfect recall I would be so thrilled.

And just because it's memorized the data--as all intelligences would need to do to spit data out--doesn't mean it can't still do useful operations on the data, or explain it in different words, or whatever a human might do with it.

danielmarkbruce2y ago

Do we? I use gpt-4 daily and it matters not to me what the source of the "intelligence" is. It's subjective what "intelligence" even means. It's subjective how the brain works. Almost by definition AI is "things that can't be objectively measured".

COAGULOPATH2y ago

What's the benefit of doing this vs copying one of the many (far superior) Javascript mp3 players on the internet, such as here?

https://freefrontend.com/javascript-music-players/

roenxi2y ago

It'd be a bit faster to get up and running with ChatGPT. In the AI, you'd have to phrase the instruction & copy the output into a file. For search, you have to do both those things and learn a UI that wasn't built to taste.

kristopolous2y ago

Almost nothing happened in AI for about 50 years. That's the normal in the field.

happycube2y ago

I got curious and did this myself. Needed a bit of nudging to get where I wanted, but I even had it make an Electron wrapper:

https://chat.openai.com/share/29d695e6-7f23-4f03-b2be-29b7c9...

dchuk2y ago

This is awesome, thanks for sharing.

Do you (or anyone) know of any products that allow for iterating on the generated output through further chatting with the ai? What I mean, is that each subsequent prompt here either generated a new whole output, or new chunks to add to the output. Ideally, whether generating code or prose, I’d want to keep prompting about the generated output and the AI further modifies the existing output until it’s refined to the degree I want it.

Or is that effectively what Copilot/cursor do and I’m just a bad operator?

2 more replies

mensetmanusman2y ago

I used gpt4 to write a script that I can ssh from my iPhone to a m1 that downloads the mp3 from a yt url on my iPhone clipboard. The only thing I am missing is automating the sync button when the iPhone is on the same home wifi to add the mp3 to the music app.

Jensson2y ago

> Two years ago it didn't even dawn on me that this would be my way of writing software in the near future

So you were ignorant two years ago, GitHub Copilot was already available to users back then. The only new big thing the past two years was GPT-4, and nothing suggest anything similar will come the next two years. There are no big new things on the horizon, we knew for quite a while that GPT-4 was coming, but there isn't anything like that this time.

mg2y ago

Copilot was not around when I wrote the Tweet.

But when Copilot came out, I was indeed ignorant! I remember when a friend showed it to me for the first time. I was like "Yeah, it outputs almost correct boilerplate code for you. But thankfully my coding is so that I don't have to write boilerplate". I didn't expect it to be able to write fully functional tools and understand them well enough to actually write pretty nice code!

Regarding "there isn't anything like that this time." : Quite the opposite! We have not figured out where using larger models and throwing more data at them will level off! This could go on for quite a while. With FSD 12, Tesla is already testing self driving with a single large neural net, without any glue code. I am super curious how that will turn out.

The whole thing is just starting.

1 more reply

isaacfung2y ago

Whisper, Stable Diffusion, VoiceBox, GPT4 vision, DALL.E3

Other breakthroughs in graph machine learning https://towardsdatascience.com/graph-ml-in-2023-the-state-of...

1 more reply

kfk2y ago

AI hype is really problematic in Enterprise. Big companies are now spending C executive time figuring out a company "AI strategy". This is going to be another cycle of money-wasted/biz-upset, very similar to what I have seen with Big data. The thing in Enterprise is that everyone serious about biz operations knows AI test scores and AI quality is not there, but very few are able to communicate these concerns in a constructive way, rather everyone is embracing the hype because, maybe they get a promotion? Tech, as usual, is very happy to feed the hype and never, as usual, telling businesses honestly that, at best, this is an incremental productivity improvement, nothing life changing. I think the issue is overall lack of honesty, professionalism, and accountability across the board, with tech leading this terrible way of pushing product and "adding value".

huijzer2y ago

> Tech, as usual, is very happy to feed the hype

I agree completely with you on this.

In defence of the executives however is that some businesses will be seriously affected. Call centres and plagiarism scanner have already been affected, but it’s unclear which industries will be affected too. Maybe the probability is low, but the impact could be very high. In think this reasoning is driving the executives.

kfk2y ago

Look, I am going to wait and see on this, maybe new facts will make me reconsider. In the meanwhile, github Copilot is just cost to my company, haven't seen much additional productivity. I guess my concern, given how hard is to hire developers and technologists, is replacing simpler job roles, like a customer service representative, with complicated new ones, like "MLOps Engineer".

1 more reply

padjo2y ago

It’s rational herd dynamics for the execs. Going against the herd and being wrong is a career ender. Going with the herd and being wrong will be neutral at worst.

mr_toad2y ago

Going against the herd and being right can also be a career ender.

jacobr12y ago

Blindly following a trend will likely not end well. But even with previous hype cycles, those companies that identified good use cases, validated those use cases, and had solid execution of the projects leaped ahead. Big Data was genuinely of value to plenty of organizations, and a waste of time for others. IoT was crazy for plenty of orgs ... but also was really valuable to certain segments. Gartner's hype cycle ends with the plateau of productivity for a reason ... you just have to go through the trough of disillusionment first, which is going to come from a the great multitudes of failed and ill-conceived projects.

janalsncm2y ago

Identifying an “AI strategy” seems backwards. What they should be doing is identifying the current problems and goals of the company and reassessing how best to accomplish them given the new capabilities which have surfaced. Perhaps “AI” is the best way. Or maybe simpler ways are better.

I’ve said it before, but as someone to whom “AI” means something more than making API calls to some SAAS, I look forward to the day they hire me at $300/hour to replace their “AI strategy” with something that can be run locally off of a consumer-grade GPU or cheaper.

hashtag-til2y ago

Agreed. I think there is a FOMO phenomena among C-level execs, that is generating a gigantic waste os money and time, creating distractions around “AI strategy”.

It started a few years back and it is now really inflamed with LLM, because of the consumer level hype and general media reporting about it.

You can perceive that by the multiple AI startups capturing millions in VC capital for absolutely bogus value proposition. Bizarre!

epups2y ago

The problem with your premise is that you're already drawing conclusions about the potential of AI and deciding it is hype. Perhaps decades ago someone could have equally criticised "Internet hype" and "mobile hype" and look foolish now.

soco2y ago

Also decades ago someone criticised "bigdata hype" and "microservices hype" and looks right now. Doing things just out of FOMO is rarely a good business decision. It can pay out, even a broken clock is right twice a day, but it's definitely bad to follow every new thing just because Gartner mentioned it. I'm not giving advice of course, but having seen enterprises betting good money even on NFT I tend to treat every new enterprise powerpoint idea with a certain dose of skepticism.

3 more replies

hdjjhhvvhga2y ago

> because, maybe they get a promotion?

While I agree with you in general, I don't think this bit is particularly fair. I'd say we know the limitations, and we also know that using LLMs might bring some advantage, and the companies that are able to use it properly will have a better position, so it makes sense to at least investigate the options.

dboreham2y ago

> AI hype is really problematic in Enterprise.

This only appears so because we here have some insight into the domain. But there have always been hype cycles. We just didn't notice them so readily.

The speed with which this happens makes me suspect there is a hidden "generic hype army" that was already in place, presumably hyping the last thing, and ready to jump on this thing.

chasd002y ago

in consulting all we hear is sell sell sell AI so i'm sure my industry isn't helping at all. I'm not on board yet, I just don't see a use case in enterprise beyond learning a knowledge base to make a more conversational self-help search and things like that. It's great that it can help right a function in javascript but that's not a watershed moment... yet. Curious to see AI project sales at end of 2024 (everything in my biz is measured in units of $$).

afro882y ago

In my case executives were more focussed on how it could be built into new projects, presales etc rather than internal efficiency improvements. A lot of people were amazed to see someone getting value out of it (efficiency gains) without building stuff around it. Blew my mind that this was the case.

gmerc2y ago

Publicly listed companies whose traditional business model is under pressure are incentivized to hype because if they don’t inspire there idea of sustained growth to their wary investors, they cautionary tale in form of Twitter (valuation low enough to lose control) exists.

In Capitalism, you grow or you die and sometimes you need to bullshit people about growth potential to buy yourself time

hashtag-til2y ago

Yes, sad but true.

JaDogg2y ago

This is exactly correct.

danielvaughn2y ago

I remember watching a documentary about an old blues guitar player from the 1920's. They were trying to learn more about him and track down his whereabouts during certain periods of his life.

At one point, they showed some old footage which featured a montage of daily life in a small Mississippi town. You'd see people shopping for groceries, going on walks, etc. Some would stop and wave at the camera.

In the documentary, they noted that this footage exists because at the time, they'd show it on screen during intermission at movie theaters. Film was still in its infancy in that time, and was so novel that people loved seeing themselves and other people on the big screen. It was an interesting use of a new technology, and today it's easy to understand why it died out. Of course, it likely wasn't obvious at the time.

I say all that because I don't think we can know at this point what AI is capable of, and how we want to use it, but we should expect to see lots of failure while we figure it out. Over the next decade there's undoubtedly going to be countless ventures similar to the "show the townspeople on the movie screen" idea, blinded by the novelty of technological change. But failed ventures have no relevance to the overall impact or worth of the technology itself.

mattkrause2y ago

> it's easy to understand why it died out

I think it's probably more sociological than technical. People love to see themselves and their friends/family. My work has screens that show photos of events and it always causes a bit of a stir ("Did you see X's photo from the summer picnic?") Yearbooks are perennially popular and there's a whole slew of social media.

However, for this to be "fun", there must be a decent chance that most people in the audience know a few people in a few of the pictures. I can't imagine this working well in a big city, for example, or a rural theatre that draws from a huge area.

iFire2y ago

Selfies and 15 second videos still exist as shorts and tiktoks.

kenjackson2y ago

What died out? Film?

danielvaughn2y ago

The practice of filming a montage around your local neighborhood or town to play during intermission. Though you could say intermission as well, since that was a legacy concept that was inherited from plays and eventually died out as well.

1 more reply

savanaly2y ago

>What died out?

The custom of showing film consisting of footage of the general public in movie theaters.

actionfromafar2y ago

Showing locals little movie clips of themselves in intermissions at the local theater.

randcraw2y ago

The debate over whether LLMs are "intelligent" seem a lot like the old debate among NLP experts whether English must be modeled as a context-free grammar (push down automaton) or finite-state machine (regular expression). Yes, any language can be modeled using regular expressions; you just need an insane number of FSMs (perhaps billions). And that seems to be the model that LLMs are using to model cognition today.

LLMs seem to use little or no abstract reasoning (is-a) or hierarchical perception (has-a), as humans do -- both of which are grounded in semantic abstraction. Instead, LLMs can memorize a brute force explosion in finite state machines (interconnected with Word2Vec-like associations) and then traverse those machines and associations as some kind of mashup, akin to a coherent abstract concept. Then as LLMs get bigger and bigger, they just memorize more and more mashup clusters of FSMs augmented with associations.

Of course, that's not how a human learns, or reasons. It seems likely that synthetic cognition of this kind will fail to enable various kinds of reasoning that humans perceive as essential and normal (like common sense based on abstraction, or physically-grounded perception, or goal-based or counterfactual reasoning, much less insight into the thought processes / perceptions of other sentient beings). Even as ever-larger LLMs "know more" by memorizing ever more FSMs, I suspect they'll continue to surprise us with persistent cognitive and perceptual deficits that would never arise in organic beings that do use abstract reasoning and physically grounded perception.

TeMPOraL2y ago

> LLMs can memorize a brute force explosion in finite state machines (interconnected with Word2Vec-like associations) and then traverse those machines and associations as some kind of mashup, akin to a coherent abstract concept.

That's actually the closest to a working definition of what a concept is. The discussion about language representation has little bearing on humans or intelligence, because it's not how we learn and use language. Similarly, the more people - be it armchair or diploma-carrying philosophers - try to find the essence of a meaning of some word, the more they fail, because it seems that meaning of any concept is defined entirely through associations with other concepts and some remembered experiences. Which again seems pretty similar to how LLMs encode information through associations in high-dimensional spaces.

freejazz2y ago

Can you recommend any books on this?

iambateman2y ago

This really is a good article, and is seriously researched. But the conclusion in the headline - “AI hype is built on flawed test scores” - feels like a poor summary of the article.

It _is_ correct to say that an LLM is not ready to be a medical doctor, even if it can pass the test.

But I think a better conclusion is that test scores don’t help us understand LLM capabilities like we think they do.

Using a human test for an LLM is like measuring a car’s “muscles” and calling it horsepower. They’re just different.

But the AI hype is justified, even if we struggle to measure it.

dleslie2y ago

Two years ago I didn't use AI at all. Now I wouldn't go without it; I have Copilot integrated with Emacs, VSCode, and Rider. I consider it a ground-breaking productivity accelerator, a leap similar to when I transitioned from Turbo Pascal 2 to Visual C 6.

That's why I'm hyped. If it's that good for me, and it's generalizable, then it's going to rock the world.

thomasfromcdnjs2y ago

Life longer programmer, and same sentiments, I use it everywhere I can.

I am currently transliterating a language PDF into a formatted lexicon, I wouldn't even be able to do this without co-pilot, it has turned this seemingly impossibly arduous task into a pleasurable one.

airstrike2y ago

Coding on something without copilot these days feels like having my hands tied. I'm looking at you, XCode and Colab...

GuB-422y ago

I don't think test scores have anything to do with the hype. Most people don't even realize test scores exist.

One is just to wow factor. It will be short lived. A bit like VR, which is awesome when you first try it, but it wears out quickly. Here, you can have a bot write convincing stories and generate nice looking images, which is awesome until you notice that the story doesn't make sense and that the images has many details wrong. This is not just a score, it is something you can see and experience.

And there is also the real thing. People start using GPT for real work. I have used it to document my code for instance, and it works really well, with it I can do a better job than without, and I can do it faster. Many students use it to do their homework, which may not be something you want, but it no less of a real use. Many artists are strongly protesting against generative AI, this in itself is telling, it means it is taken seriously, and at the same time, other artists are making use of it.

It is even use for great effect where you don't notice. Phone cameras are a good example, by enhancing details using AI, they give you much better pictures than what the optics are capable of. Some people don't like that because the picture are "not real", but most enjoy the better perceived quality. Then, there are image classifiers, speech-to-text and OCR, fuzzy searching, content ranking algorithms we love to hate, etc... that all make use of AI.

Note: here AI = machine learning with neural networks, which is what the hype is about. AI is a vague term that can mean just about anything.

Jensson2y ago

> I don't think test scores have anything to do with the hype. Most people don't even realize test scores exist.

They put the test scores front and center in the initial announcement with a huge image showing improvements on AP exams, it was the main thing people talked about during the announcement and the first thing anyone who read anything about gpt-4 sees.

I don't think many who are hyped about these things missed that.

https://openai.com/research/gpt-4

GuB-422y ago

It is what they talk about during announcements because people like numbers. It looks more serious than "hey look, GPT-4 smart" with some example quotes that anyone knows are cherry picked. But the real hype comes from people trying for themselves.

I seriously don't remember hearing these test results being mentioned in any casual conversation, and I heard a lot of casual conversations about AI. The majority of these center around personal experiences ("I asked ChatGPT this and I got that..."), homework is another common topic. When we compare systems, we won't say "this one got a 72 and the other got a 94", but more like "I asked new system to give me a specific piece of code (or cocktail recipe, or anything) and the result is much better". Again, personal experience and anecdotes before scores.

Maybe people in the field hype themselves with score, but not the general public, and probably not the investors either, who will most likely look at the financial performance of the likes of OpenAI instead.

bensecure2y ago

If you followed the initial announcement, then you were presumably already hyped. The novel thing about chatgpt has been the mass amount of people who hadn't heard about generative AI in the past glomming onto the technology. Most of these people heard about it via word of mouth. They then tried it themselves and told people about it. They never even heard of tests, let alone based their perception on them.

dmezzetti2y ago

This video from Yann LeCun gives a great summary on where things stand. https://www.youtube.com/watch?v=pd0JmT6rYcI

He is of the opinion the current generation transformers architecture is flawed and it will take a new generation of models to get close to the hype.

PeterisP2y ago

It's not built on high test scores - while academics do benchmark models on various tests, all the many people who built up the hype mostly did it based on their personal experience with a chatbot, not by running some long (and expensive) tests on those datasets.

The tests are used (and, despite their flaws, useful) to compare various facets of model A to model B - however, the validation whether a model is good now comes from users, and that validation really can't be flawed much - if it's helpful (or not) to someone, then it is what it is, the proof of the pudding is in the eating.

waynenilsen2y ago

This article is absurd.

> But when a large language model scores well on such tests, it is not clear at all what has been measured. Is it evidence of actual understanding? A mindless statistical trick? Rote repetition?

It is measuring how well it does _at REPLACING HUMANS_. It is hard to believe how the author clearly does not understand this. I don't care how it obtains its results.

GPT-4 is like a hyperspeed entry to mid level dev that has almost no ability to contextualize. Tools built on top of 32k will allow repo ingestion.

This is the worst it will ever be.

COAGULOPATH2y ago

>It is measuring how well it does _at REPLACING HUMANS_

It's possible to do well on a test and have no ability to do the thing the job tests for.

GPT-4 scores well on an advanced sommelier exam, but obviously cannot replace a human sommelier, because it does not have a mouth.

dartos2y ago

Which tests test specifically for “replacing humans?” That seems like a wild metric to try and capture in a test.

Also an aside:

> This is the worse it will ever be.

I hear this a lot and it really bothers me. Just because something is the worst it’ll ever be doesn’t mean it’ll get much better. There could always be a plateau on the horizon.

It’s akin to “just have faith.” A real weird sentiment that I didn’t notice in tech before 2021.

iudqnolq2y ago

GPT passed a test on the theoretical fundamentals of selling and serving wine in fancy restaurants. In a human passing such a test provides a useful signal of job suitability because people who pass it are often also capable of the physical bits, like theatrically opening wine bottles. But obviously that doesn't work for an AI.

Lots of things usefully correlate with test scores in humans but might not in an AI.

RandomLensman2y ago

It is measuring how well it does replacing humans - in those tests.

chewxy2y ago

I note something very interesting in the AI hype, and I would like someone to help explain it.

Whenever there's a news or article noting the limits of current LLM tech (especially the GPT class of models from OpenAI), there's always a comment that says something along the lines of "ah did you test it on GPT-4"?

Or if it's clear that it's the limitation of GPT-4, then you have comments along the lines of "what's the prompt?", or "the prompt is poor". Usually, it's someone who hasn't in the past indicated that they understand that prompt engineering is model specific, and the papers' point is to make a more general claim as opposed to a claim on one model.

Can anyone explain this? It's like the mere mention of LLMs being limited in X, Y, Z fashion offends their lifestyle/core beliefs. Or perhaps it's a weird form of astroturfing. To which, I ask, to what end?

TeMPOraL2y ago

> there's always a comment that says something along the lines of "ah did you test it on GPT-4"?

Perhaps because whenever there's "a news or article noting the limits of current LLM tech", it's a bit like someone tried to play a modern game on a machine they found in their parents' basement, and the only appropriate response to this is, "have you tried running it on something other than a potato"? This has been happening so often over the past few months that it's the first red flag you check for.

GPT-4 is still qualitatively ahead of all other LLMs, so outside of articles addressing specialized aspects of different model families, the claims are invalid unless they were tested on GPT-4.

(Half the time the problem is that the author used ChatGPT web app and did not even realize there are two models and they've been using the toy one.)

jacobr12y ago

As someone who has this instinct myself, there is a line of reactionism to modern AI/ML that says, "this is just a toy, look it can't do something simple." But often the case, if _can_ do that thing with a either a more advanced model, or a more built-out system. So the instinct is to try and explain that the pessimism is wrong. That we really can push the boundary and do more, even if it isn't going to work out of the box yet. I react that way against all forms of poppy snipping.

Jensson2y ago

Hyping up tech based on what you think it will be able to do in the future is the misplaced overhyping that is the problem. The issues people say are easy to fix aren't easy to fix.

Expect the model to continue to perform like it does today, and then lots of dumb integrations added to it, and you will get a very accurate prediction of how most of new tech hype turns out. Dumb integrations can't add intelligence, but it can add a lot of value, so the rational hype still sees this as a very valuable and exciting thing, but it isn't a complete revolution in its current form.

jazzyjackson2y ago

The output of any model is essentially random and whether it is useful or impressive is a coin flip. While most people get a mix of heads and tails, there are a few people at any time that are getting streaks of one head after another or vice versa.

So my perception is this leads to people who have good luck and perceive LLMs as near AGI because it arrives at a useful answer more often than not, and these people cannot believe there are others who have bad luck and get worthless output from their LLM, like someone at a roulette table exhorting "have you tried betting it all on black? worked for me!"

wrsh072y ago

1. Just like it's frustrating when a paper is published making claims that are hard to verify, it's frustrating when somebody says "x can't do y" in a way that is hard to verify^^

2. LLMs, in spite of the complaints about the research leaders, are fairly democratic. I have access to several of the best LLMs currently in existence and the ones I can't access haven't been polished for general usage anyway. If you make a claim with a prompt, it's easy for me to verify it

3. I've been linked legitimate ChatGPT prompts where someone gets incorrect data from ChatGPT - my instinct is to help them refine their prompt to get correct data

4. If you make a claim about these cool new tools (not making a claim about what they're good for!) all of these kick in. I want to verify, refine, etc.

Of course some people are on the bandwagon and it is akin to insulting their religion (it is with religious fervor they hold their beliefs!) but at least most folks on hn are just excited and trying to engage

^^ I actually think making this claim is in bad form generally. It's like looking for the existence of aliens on a planet. Absence of evidence is not evidence of absence

epups2y ago

If someone comes here and says "<insert programming language> cannot do X" and that is wrong, or perhaps outdated, don't you feel that the reaction would be similar?

If you are trying to make categorical statements about what AI is unable to do, at the very least you should use a state-of-the-art system, which conveniently is easily available for everyone.

stevenhuang2y ago

Because they're saying it can't do something when they're holding it wrong.

It's a weird thing to get hung up on if you ask me.

abm532y ago

Perhaps they are trying to help people get the best out of a tool which they themselves find very useful?

epups2y ago

I think ironically there has been an "AI-anti-hype hype", with people like Gary Marcus trying to blow up every single possible issue into a deal breaker. Most of the claims in this article are based on tests performed only on GPT-3, and researchers often seem to make tests in a way that proves their point - see an earlier comment from me here with an example: https://news.ycombinator.com/item?id=37503944

I agree there has been many attention-grabbing headlines that are due to simple issues like contamination. However, I think AI has already proved its business value far beyond those issues, as anyone using ChatGPT with a code base not present in their dataset can attest.

smcl2y ago

I think some amount of that is necessary, though no? We have people claiming that this generation of AI will replace jobs - and plenty of companies have taken the bait and tried to get started with LLM-based bots. We even had a pretty high-profile case of a Google AI engineer going public with claims that their LaMDA AI was sentient. Regardless of what you think of that individual or Google's AI efforts, this resonates with the public. Additionally a pretty common sentiment I've seen has been non-tech people suggesting AI should handle content moderation - the idea being that since they're not human and don't have "feelings" they won't have biases and won't attempt to "silence" any one political group (without realising that bias can be built in via the training data).

It seems pretty important to counter that and to debunk any wild claims such as these. To provide context and to educate the world on their shortcomings.

epups2y ago

I think skepticism is always welcome and we should continue to explore what LLM's can and cannot do. However, what I'm referring to is trying to get a quick win by defeating some inferior version of GPT or trying to apply a test which you don't even expect most humans to pass.

The article is actually fine and pretty balanced, but it is a bit unfortunate that 80% of their examples are not illustrative of current capabilities. At least for me, most of my optimism about the utility of LLM's comes from GPT-4 specifically.

bondarchuk2y ago

>But there’s a problem: there is little agreement on what those results really mean. Some people are dazzled by what they see as glimmers of human-like intelligence; others aren’t convinced one bit.

I find the whole hype & anti-hype dynamic so tiresome. Some are over-hyping, others are responding with over-anti-hyping. Somewhere in-between are many reasonable, moderate and caveated opinions, but neither the hypesters or anti-hypesters will listen to these (considering all of them to come from people at the opposite extreme), nor will outside commentators (somehow being unable to categorize things as anything more complicated than this binary).

Closi2y ago

Depends if the hype is invalid - Let's remember that "There will be a computer in every home!" was once considered hype.

There is a possible world where AI will be a truly transformative technology in ways we can't possibly understand.

There is a possible world where this tech fizzles out.

So one of the reasons that there is a broad 'hype' dynamic here is because the range of possibilities is broad.

I sit firmly in the first camp though - I believe it's truly a transformative technology, and struggle to see the perspective of the 'anti-hype' crowd.

TerrifiedMouse2y ago

I’m in the second camp. To every hyped up tech, all I can say is “prove it”. Give me actual real world results.

There are millions of hustlers out there pushing snake oil. The probability that something is the real deal and not snake oil is small. Better to assuming the glass is half empty.

1 more reply

mcguire2y ago

"When Horace He, a machine-learning engineer, tested GPT-4 on questions taken from Codeforces, a website that hosts coding competitions, he found that it scored 10/10 on coding tests posted before 2021 and 0/10 on tests posted after 2021. Others have also noted that GPT-4’s test scores take a dive on material produced after 2021. Because the model’s training data only included text collected before 2021, some say this shows that large language models display a kind of memorization rather than intelligence."

I'm sure that is just a matter of prompt engineering, though.

COAGULOPATH2y ago

But it got 10/10 on pre-2021 questions, with the same prompting method...

robertlagrant2y ago

> AI hype is built on high test scores

No, it's built on people using DALLE and Midjourney and ChatGPT.

yCombLinks2y ago

Exactly, chatpgt is double checking my homework problems and pointing out my errors, it's teaching me the material better than any of my lectures. It's writing tons of code I'm getting paid for, with way less overhead than trying to explain the problem to a junior, less mistakes and faster iteration. Test scores, ridiculous

nojvek2y ago

‘Pre-training on the Test Set Is All You Need‘

GPT-4 is really smart to dig information it has seen before, but please don’t use it for any serious reasoning. Always take the answer with a grain of salt.

refulgentis2y ago

This is my favorite new AI argument, took me a few months to see it. Enjoyed it at first.

You start with everyone knows there's AI hype from tech bros. Then you introduce a PhD or two at institutions with good names. Then they start grumbling about anthropomorphizing and who knows what AI is anyway.

Somehow, if it's long enough, you forget that this kind of has nothing to do with anything. There is no argument. Just imagining other people must believe crazy things and working backwards from there to find something to critique.

Took me a bit to realize it's not even an argument, just parroting "it's a stochastic parrot!" Assumes other people are dunces and genuinely believe it's a minihuman. I can't believe MIT Tech Review is going for this, the only argument here is the tests are flawed if you think they're supposed to show the AI model is literally human.

MrYellowP2y ago

I disagree entirely.

The hype is based entirely on the fact that I can talk (in text) to a machine and it responds like a human. It might sometimes make up stuff, but so do humans. I therefore don't consider that a significant downside, or problem. In the end chatgpt is still ... a baby.

The hype builds around the fact that I can run a language model that fits into my graphics cards and responds at faster-than-typing speed, which is sufficient.

The hype builds around the fact that it can create and govern whole text based games for me, if I just properly ask it to do so.

The hype builds around the fact that I can have this everywhere with me, all day long, whenever I want. It never grows tired, it never stops answering, it never scoffs at me, it never hates me, it never tells me that I'm stupid, it never tells me that I'm not capable of doing something.

It always teaches me, always offers me more to learn, it always is willingly helping me, it never intentionally tries to hide the fact that it doesn't know something and never intentionally tries to impress me just to get something from me.

Can it get things wrong? Sure! Happens! Happens to everyone. Me, you, your neighbour, parents, teachers, plumbers.

Not a single minute did I, or dozens of millions of others, give a single flying fuck about test scores.

janalsncm2y ago

The only test I need is the amount of time it takes me to do common tasks with and without ChatGPT. I’m aware it’s not perfect but perfect was never necessary.

derbOac2y ago

This was interesting to me but mostly because of a question I thought it was going to focus on, which is how should we interpret these tests when a human takes it?

I wasn't sure that the phenomena they discussed was as relevant to the question of whether AI is overhyped as they made it out to be, but I did think a lot of questions about the meaning of the performances were important.

What's interesting to me is you could flip this all on its head and, instead of asking "what can we infer about the machine processes these test scores are measuring?", we could ask "what does this imply about the human processes these test scores are measuring?"

A lot of these test are well-validated but overinterpreted I think, and leaned on too heavily to make inferences about people. If a machine can pass a test, for instance, what does it say about the test as used in people? Should we be putting as much weight on them as we do?

I'm not arguing these tests are useless or something, just that maybe we read into them too much to begin with.

Cloudef2y ago

AI is honestly wrong word to use. These are ML models and they are able to only do the task they have been specifically trained for (not saying the results aren't impressive!). There really isn't competition either as the only people who can train these giant models are those who have the cash.

TeMPOraL2y ago

> These are ML models and they are able to only do the task they have been specifically trained for

Yes, but the models we're talking about have been trained specifically on the task of "complete arbitrary textual input in a way that makes sense to humans", for arbitrary textual input, and then further tuned for "complete it as if you were a person having conversation with a human", again for arbitrary text input - and trained until they could do so convincingly.

(Or, you could say that with instruct fine-tuning, they were further trained to behave as if they were an AI chatbot - the kind of AI people know from sci-fi. Fake it 'till you make it, via backpropagation.)

In short, they've been trained on an open-ended, general task of communicating with humans using plain text. That's very different to typical ML models which are tasked to predict some very specific data in a specialized domain. It's like comparing a Python interpreter to Notepad - both are just regular software, but there's a meaningful difference in capabilities.

As for seeing glimpses of understanding in SOTA LLMs - this makes sense under the compression argument: understanding is lossy compression of observations, and this is what the training process is trying to force to happen, squeezing more and more knowledge into a fixed set of model weights.

Cloudef2y ago

Yes, this is why I think the LLM and image generation models are still impressive. Knowing they are ML models in the end and still produce a results that surprise us, makes you wonder what we are in the end. Could we essentially simulate something similar to us given enough inputs and parameters in the network, with enough memory, computing power and a training process that would aim to simulate a human with emotions. I would imagine the training process alone would need bunch of other models to teach the final model "concepts" and from there perhaps "reasoning".

Why I think AI is not the appropriate term is that if it were AI, the AI would have already figured everything out for us (or for itself). LLM can only chain text, it does not really understand the content of the text, and can't come up with new novel solutions (or if it accidentally does, it's due to hallucination), this can be easily confirmed by giving current LLMs some simple puzzles, math problems and so on.. Image models have similar issues.

pixl972y ago

>AI is honestly wrong word to use

https://en.wikipedia.org/wiki/AI_effect

Just because you don't like how poorly the term AI is defined, doesn't mean it is the wrong term.

AI can never be well defined because the word intelligence itself is not well defined.

javier_e062y ago

As a developer when I work with ChatGPT I can see ChatGPT eventually taking over my JIRA stories. Then ChatGPT will take over management creating product roadmaps, prioritizing and assigning tasks to itself. All dictated by customer feedback. The clock is ticking. But reasoning like a human? No.

Garvi2y ago

Counterpoint: Journalism is dead and has been replaced with algorithms that supply articles on a supply and demand basis.

"25% of the potential target audience dislikes AI and do not have their opinion positively represented in the media they consume. The potential is unsaturated. Maximum saturation estimated at 15 articles per week."

A bit more serious: AI hasn't even scratched the surface. Once we apply LLMs to speech synth and improve the visual generators by just a tiny bit, to fix faces, we can basically tell the AI to "create the best romantic comedy ever made".

"Oh, and repeat 1000 times, please".

rvz2y ago

Most of the hype comes from the AI grifters who need to find the next sucker to dump their VC shares onto to the next greater fool to purchase their ChatGPT-wrapper snake oil project to at an overvalued asking price.

The ones who have to dismantle the hype are the proper technologies such as Yann LeCun and Grady Booch who know exactly what they are talking about.

rvz2y ago

*technologists

rahimnathwani2y ago

  “People have been giving human intelligence tests—IQ tests and so on—to machines since the very beginning of AI,” says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. “The issue throughout has been what it means when you test a machine like this. It doesn’t mean the same thing that it means for a human.”

The last sentence above is an important point that most people don't consider.

api2y ago

It seems a bit like having a human face off in a race against a car and then concluding that cars have exceeded human physical dexterity.

It's not an apples/apples comparison. The nature of the capability profile of a human vs. any known machine is radically different. Machines are intentionally designed to have extreme peaks of performance in narrow areas. Present-generation AI might be wider in its capabilities than what we've previously built, but it's still rather narrow as you quickly discover if you start trying to use it on real tasks.

aldousd6662y ago

Only idiots are basing their excitement about what's possible on those test scores. They're just an attempt to measure one bot against another. There is a strong possibility that they are only measuring how well the bot takes the test, and nothing at all about what the tests themselves purport to measure. I mean, those tests are probably similar to stuff that's in the training data.

ehutch792y ago

Yeah... there's a lot of idiots out there.

aidenn02y ago

Any task that gets solved with AI retroactively becomes something that doesn't require reasoning.

janalsncm2y ago

I wouldn’t say that. Chess certainly requires reasoning even if that reasoning is minimax.

I suppose in the context of this article “AI” means statistical language models.

nomel2y ago

Why does chess require reasoning? Do all of the these [1] "reason"? ChatGPT-4 is supposedly rated worse than 500, in this list (1400 or so, although I think a recent update improved it a bit).

[1] https://ccrl.chessdom.com/ccrl/4040/

1 more reply

Kalanos2y ago

Didn't it perform well on both the SAT and LSAT though?

yieldcrv2y ago

This was 2 months ago, irrelevant in AI time

j / k navigate · click thread line to collapse

230 comments

mg2y ago

I don't think the "hype" is built on test scores.

Just two years ago, I was mesmerized by GPT-3's ability to understand concepts:

https://twitter.com/marekgibney/status/1403414210642649092

Nowadays, using it daily in a productive fashion feels completely normal.

It's hard to imagine where we will be in 20 years.

rob742y ago

denton-scratch2y ago

> And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future [...]

1 more reply

kromem2y ago

That's not the case. It's very much in the realm of "we don't know what's going on in the network."

For something as complex as SotA neural networks, binary sweeping statements seem rather unlikely to actually be representative...

1 more reply

YetAnotherNick2y ago

> it's not capable of actually reasoning

Define reasoning. Because in my definition GPT 4 could reason without doubt. It definitely can't reason better than experts in the field, but it could reason better than say interns.

6 more replies

smaddox2y ago

> And that's because, despite all of its training data, it's not capable of actually reasoning.

Your conclusion doesn't follow from your premise.

3 more replies

dinosaurdynasty2y ago

jug2y ago

CryptoBojack2y ago

Isn't that the same as for humans? If you are speaking with me (prompting), my answers will be differents, based on how you prompted me for an answer.

denton-scratch2y ago

> I was mesmerized by GPT-3's ability to understand concepts

This language embodies the anthropomorphic assumptions that the author is attacking.

dboreham2y ago

2 more replies

gmerc2y ago

This.

We are in a Cambrian Explosion on the software side and hardware hasn’t yet reacted to it. There’s a few years of mad discovery in front of us.

People have different impressions as to the shape of the curve that’s going up and right, but only a fool would not stop and carefully take what is happening.

kossTKR2y ago

Exactly and things are actually getting crazy now. Pardon the tangent but for some reason this hasn't reached the frontpage on HN yet: https://github.com/OpenBMB/ChatDev

This coupled with access to linux containers that can be spawned on demand, we are in for a wild ride!

2 more replies

adrians12y ago

> If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.

mg2y ago

Why do you think so?

Technological progress seems rather accelerated than diminishing to me.

Computers are a great example: They have been getting more capable exponentially over the last decades.

In terms of performance (memory, speed, bandwidth) and in terms of impact. First we had calculators, then we had desktop applications, then the Internet and now we have AI.

And AI will help us get to the next stage even faster.

1 more reply

james-revisoai2y ago

huijzer2y ago

RC_ITR2y ago

You can check my post history to see how unpopular this point of view is, but the big "reveal" that will come up is as follows:

The way that LLMs and humans "think" is inherently different. Giving an LLM a test designed for humans is akin to giving a camera a 'drawing test.'

A camera can make a better narrow final output than a human, but it cannot do the subordinate tasks that a human illustrator could, like changing shadings, line width, etc.

An LLM can answer really well on tests, but it often fails at subordinate tasks like 'applying symbolic reasoning to unfamiliar situations.'

Eventually the thinking styles may converge in a way that makes the LLMs practically more capable than humans on those subordinate tasks, but we are not there yet.

ulnarkressty2y ago

AI is getting subjectively better, and we need better tests to figure out if this improvement is objectively significant or not.

tuatoru2y ago

> Most of the improvements apparently come from training larger models with more data.

Nevermark2y ago

That’s backwards.

Training a model on more data improves generalization not memorization.

To store more information in the same number of parameters requires the commonality between examples to be encoded.

In contrast, the less data trained on, especially if repeated, lets the network learn to provide good answers for that limited set without generalizing. I.e. memorizing.

——

It’s the same as with people. The more variations people see of something, the more likely they intuit the underlying pattern.

The fewer examples, the more likely they just pattern match.

1 more reply

qup2y ago

I pretty much want the LLM to be great at memorizing things. That's what I'm not great at.

If it had perfect recall I would be so thrilled.

danielmarkbruce2y ago

COAGULOPATH2y ago

What's the benefit of doing this vs copying one of the many (far superior) Javascript mp3 players on the internet, such as here?

https://freefrontend.com/javascript-music-players/

roenxi2y ago

kristopolous2y ago

Almost nothing happened in AI for about 50 years. That's the normal in the field.

happycube2y ago

I got curious and did this myself. Needed a bit of nudging to get where I wanted, but I even had it make an Electron wrapper:

https://chat.openai.com/share/29d695e6-7f23-4f03-b2be-29b7c9...

dchuk2y ago

This is awesome, thanks for sharing.

Or is that effectively what Copilot/cursor do and I’m just a bad operator?

2 more replies

mensetmanusman2y ago

Jensson2y ago

> Two years ago it didn't even dawn on me that this would be my way of writing software in the near future

mg2y ago

Copilot was not around when I wrote the Tweet.

The whole thing is just starting.

1 more reply

isaacfung2y ago

Whisper, Stable Diffusion, VoiceBox, GPT4 vision, DALL.E3

Other breakthroughs in graph machine learning https://towardsdatascience.com/graph-ml-in-2023-the-state-of...

1 more reply

kfk2y ago

huijzer2y ago

> Tech, as usual, is very happy to feed the hype

I agree completely with you on this.

kfk2y ago

1 more reply

padjo2y ago

It’s rational herd dynamics for the execs. Going against the herd and being wrong is a career ender. Going with the herd and being wrong will be neutral at worst.

mr_toad2y ago

Going against the herd and being right can also be a career ender.

jacobr12y ago

janalsncm2y ago

hashtag-til2y ago

Agreed. I think there is a FOMO phenomena among C-level execs, that is generating a gigantic waste os money and time, creating distractions around “AI strategy”.

It started a few years back and it is now really inflamed with LLM, because of the consumer level hype and general media reporting about it.

You can perceive that by the multiple AI startups capturing millions in VC capital for absolutely bogus value proposition. Bizarre!

epups2y ago

soco2y ago

3 more replies

hdjjhhvvhga2y ago

> because, maybe they get a promotion?

dboreham2y ago

> AI hype is really problematic in Enterprise.

This only appears so because we here have some insight into the domain. But there have always been hype cycles. We just didn't notice them so readily.

The speed with which this happens makes me suspect there is a hidden "generic hype army" that was already in place, presumably hyping the last thing, and ready to jump on this thing.

chasd002y ago

afro882y ago

gmerc2y ago

In Capitalism, you grow or you die and sometimes you need to bullshit people about growth potential to buy yourself time

hashtag-til2y ago

Yes, sad but true.

JaDogg2y ago

This is exactly correct.

danielvaughn2y ago

I remember watching a documentary about an old blues guitar player from the 1920's. They were trying to learn more about him and track down his whereabouts during certain periods of his life.

mattkrause2y ago

> it's easy to understand why it died out

iFire2y ago

Selfies and 15 second videos still exist as shorts and tiktoks.

kenjackson2y ago

What died out? Film?

danielvaughn2y ago

1 more reply

savanaly2y ago

>What died out?

The custom of showing film consisting of footage of the general public in movie theaters.

actionfromafar2y ago

Showing locals little movie clips of themselves in intermissions at the local theater.

randcraw2y ago

TeMPOraL2y ago

freejazz2y ago

Can you recommend any books on this?

iambateman2y ago

This really is a good article, and is seriously researched. But the conclusion in the headline - “AI hype is built on flawed test scores” - feels like a poor summary of the article.

It _is_ correct to say that an LLM is not ready to be a medical doctor, even if it can pass the test.

But I think a better conclusion is that test scores don’t help us understand LLM capabilities like we think they do.

Using a human test for an LLM is like measuring a car’s “muscles” and calling it horsepower. They’re just different.

But the AI hype is justified, even if we struggle to measure it.

dleslie2y ago

That's why I'm hyped. If it's that good for me, and it's generalizable, then it's going to rock the world.

thomasfromcdnjs2y ago

Life longer programmer, and same sentiments, I use it everywhere I can.

airstrike2y ago

Coding on something without copilot these days feels like having my hands tied. I'm looking at you, XCode and Colab...

GuB-422y ago

I don't think test scores have anything to do with the hype. Most people don't even realize test scores exist.

Note: here AI = machine learning with neural networks, which is what the hype is about. AI is a vague term that can mean just about anything.

Jensson2y ago

> I don't think test scores have anything to do with the hype. Most people don't even realize test scores exist.

I don't think many who are hyped about these things missed that.

https://openai.com/research/gpt-4

GuB-422y ago

bensecure2y ago

dmezzetti2y ago

This video from Yann LeCun gives a great summary on where things stand. https://www.youtube.com/watch?v=pd0JmT6rYcI

He is of the opinion the current generation transformers architecture is flawed and it will take a new generation of models to get close to the hype.

PeterisP2y ago

waynenilsen2y ago

This article is absurd.

> But when a large language model scores well on such tests, it is not clear at all what has been measured. Is it evidence of actual understanding? A mindless statistical trick? Rote repetition?

It is measuring how well it does _at REPLACING HUMANS_. It is hard to believe how the author clearly does not understand this. I don't care how it obtains its results.

GPT-4 is like a hyperspeed entry to mid level dev that has almost no ability to contextualize. Tools built on top of 32k will allow repo ingestion.

This is the worst it will ever be.

COAGULOPATH2y ago

>It is measuring how well it does _at REPLACING HUMANS_

It's possible to do well on a test and have no ability to do the thing the job tests for.

GPT-4 scores well on an advanced sommelier exam, but obviously cannot replace a human sommelier, because it does not have a mouth.

dartos2y ago

Which tests test specifically for “replacing humans?” That seems like a wild metric to try and capture in a test.

Also an aside:

> This is the worse it will ever be.

I hear this a lot and it really bothers me. Just because something is the worst it’ll ever be doesn’t mean it’ll get much better. There could always be a plateau on the horizon.

It’s akin to “just have faith.” A real weird sentiment that I didn’t notice in tech before 2021.

iudqnolq2y ago

Lots of things usefully correlate with test scores in humans but might not in an AI.

RandomLensman2y ago

It is measuring how well it does replacing humans - in those tests.

chewxy2y ago

I note something very interesting in the AI hype, and I would like someone to help explain it.

TeMPOraL2y ago

> there's always a comment that says something along the lines of "ah did you test it on GPT-4"?

GPT-4 is still qualitatively ahead of all other LLMs, so outside of articles addressing specialized aspects of different model families, the claims are invalid unless they were tested on GPT-4.

(Half the time the problem is that the author used ChatGPT web app and did not even realize there are two models and they've been using the toy one.)

jacobr12y ago

Jensson2y ago

Hyping up tech based on what you think it will be able to do in the future is the misplaced overhyping that is the problem. The issues people say are easy to fix aren't easy to fix.

jazzyjackson2y ago

wrsh072y ago

1. Just like it's frustrating when a paper is published making claims that are hard to verify, it's frustrating when somebody says "x can't do y" in a way that is hard to verify^^

3. I've been linked legitimate ChatGPT prompts where someone gets incorrect data from ChatGPT - my instinct is to help them refine their prompt to get correct data

4. If you make a claim about these cool new tools (not making a claim about what they're good for!) all of these kick in. I want to verify, refine, etc.

^^ I actually think making this claim is in bad form generally. It's like looking for the existence of aliens on a planet. Absence of evidence is not evidence of absence

epups2y ago

If someone comes here and says "<insert programming language> cannot do X" and that is wrong, or perhaps outdated, don't you feel that the reaction would be similar?

If you are trying to make categorical statements about what AI is unable to do, at the very least you should use a state-of-the-art system, which conveniently is easily available for everyone.

stevenhuang2y ago

Because they're saying it can't do something when they're holding it wrong.

It's a weird thing to get hung up on if you ask me.

abm532y ago

Perhaps they are trying to help people get the best out of a tool which they themselves find very useful?

epups2y ago

smcl2y ago

It seems pretty important to counter that and to debunk any wild claims such as these. To provide context and to educate the world on their shortcomings.

epups2y ago

bondarchuk2y ago

Closi2y ago

Depends if the hype is invalid - Let's remember that "There will be a computer in every home!" was once considered hype.

There is a possible world where AI will be a truly transformative technology in ways we can't possibly understand.

There is a possible world where this tech fizzles out.

So one of the reasons that there is a broad 'hype' dynamic here is because the range of possibilities is broad.

I sit firmly in the first camp though - I believe it's truly a transformative technology, and struggle to see the perspective of the 'anti-hype' crowd.

TerrifiedMouse2y ago

I’m in the second camp. To every hyped up tech, all I can say is “prove it”. Give me actual real world results.

There are millions of hustlers out there pushing snake oil. The probability that something is the real deal and not snake oil is small. Better to assuming the glass is half empty.

1 more reply

mcguire2y ago

I'm sure that is just a matter of prompt engineering, though.

COAGULOPATH2y ago

But it got 10/10 on pre-2021 questions, with the same prompting method...

robertlagrant2y ago

> AI hype is built on high test scores

No, it's built on people using DALLE and Midjourney and ChatGPT.

yCombLinks2y ago

nojvek2y ago

‘Pre-training on the Test Set Is All You Need‘

GPT-4 is really smart to dig information it has seen before, but please don’t use it for any serious reasoning. Always take the answer with a grain of salt.

refulgentis2y ago

This is my favorite new AI argument, took me a few months to see it. Enjoyed it at first.

MrYellowP2y ago

I disagree entirely.

The hype builds around the fact that I can run a language model that fits into my graphics cards and responds at faster-than-typing speed, which is sufficient.

The hype builds around the fact that it can create and govern whole text based games for me, if I just properly ask it to do so.

Can it get things wrong? Sure! Happens! Happens to everyone. Me, you, your neighbour, parents, teachers, plumbers.

Not a single minute did I, or dozens of millions of others, give a single flying fuck about test scores.

janalsncm2y ago

The only test I need is the amount of time it takes me to do common tasks with and without ChatGPT. I’m aware it’s not perfect but perfect was never necessary.

derbOac2y ago

This was interesting to me but mostly because of a question I thought it was going to focus on, which is how should we interpret these tests when a human takes it?

I'm not arguing these tests are useless or something, just that maybe we read into them too much to begin with.

Cloudef2y ago

TeMPOraL2y ago

> These are ML models and they are able to only do the task they have been specifically trained for

Cloudef2y ago

pixl972y ago

>AI is honestly wrong word to use

https://en.wikipedia.org/wiki/AI_effect

Just because you don't like how poorly the term AI is defined, doesn't mean it is the wrong term.

AI can never be well defined because the word intelligence itself is not well defined.

javier_e062y ago

Garvi2y ago

Counterpoint: Journalism is dead and has been replaced with algorithms that supply articles on a supply and demand basis.

"Oh, and repeat 1000 times, please".

rvz2y ago

The ones who have to dismantle the hype are the proper technologies such as Yann LeCun and Grady Booch who know exactly what they are talking about.

rvz2y ago

*technologists

rahimnathwani2y ago

  “People have been giving human intelligence tests—IQ tests and so on—to machines since the very beginning of AI,” says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. “The issue throughout has been what it means when you test a machine like this. It doesn’t mean the same thing that it means for a human.”

The last sentence above is an important point that most people don't consider.

api2y ago

It seems a bit like having a human face off in a race against a car and then concluding that cars have exceeded human physical dexterity.

aldousd6662y ago

ehutch792y ago

Yeah... there's a lot of idiots out there.

aidenn02y ago

Any task that gets solved with AI retroactively becomes something that doesn't require reasoning.

janalsncm2y ago

I wouldn’t say that. Chess certainly requires reasoning even if that reasoning is minimax.

I suppose in the context of this article “AI” means statistical language models.

nomel2y ago

Why does chess require reasoning? Do all of the these [1] "reason"? ChatGPT-4 is supposedly rated worse than 500, in this list (1400 or so, although I think a recent update improved it a bit).

[1] https://ccrl.chessdom.com/ccrl/4040/

1 more reply

Kalanos2y ago

Didn't it perform well on both the SAT and LSAT though?

yieldcrv2y ago

This was 2 months ago, irrelevant in AI time

j / k navigate · click thread line to collapse