Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4 (opens in new tab)

(arxiv.org)

44 pointssaurabh20n3y ago60 comments

60 comments

This aligns well with my personal experience using gpt-4.

The model provides surprisingly good responses on topics which I know are readily available online while being potentially troublesome to find the exact information I want. I have even found it useful when I know there is a tool for what I want but can’t recall the jargon used to find it via Google. Simply describing the rough idea is enough to get the model to spit out the jargon I need.

However, the moment I ask a real question that goes beyond summarizing something which is covered thousands of times online, I am immediately let down.

Is this just a result of the foundation of the model being the world best autocompletion engine? My assessment is “yes” and I don’t believe that any of the modifications coming, like plugins, will fundamentally change this.

raydiatian3y ago

I have been thinking for a few weeks now that we need another term for large language models trained on colossal datasets: AGK, artificially generally/globally knowledgeable. It can mimic a likeness of problem solving because the corpus it was trained on is full of problem/solution pairs in the abstract. But task it with any novel problem solving challenge outside of its training that is of sufficient complexity and it will balk, thereby precluding it from being AGI, because humans are by nature problem solvers.

Furthermore, I just don’t feel like the transformer architecture is suited for problem solving. Like I may just be a charlatan but self attention over the space of words does not seem like it’s going to be enough, and praying it falls out in emergent behavior if we can just add more parameters is… unscientific-ish? Now, if you could figure out a way to do self-attention over the space of concepts? Maybe you’ve got something.

I feel like AlphaGo ideas and some variation on MCTS is more likely to produce a solid problem solving architecture.

famouswaffles3y ago

Reading the paper it seems they are problems a lot of people would fail at it too, at least some of the time. LLMs are not superhuman in logical reasoning seems to be the conclusion more than anything.

raydiatian3y ago

What you’re saying gets to the core of why I would call it AGK and not AGI. Training a transformer on known answers to problems and then observing that it can successfully answer questions related to those problems is cheating.

I think the way that Ilya suggests that the “test for consciousness is to train a model with an absolute absence of any training example remotely referring to the notion of a self or of feeling, and then ask it questions about feeling. If the model can do it, congrats, you’ve discovered consciousness.” Similarly, if you train an architecture on exclusively the building blocks of a particular class of problem, and also avoid training it on any sort of problem where it could just reason by analogy and get a correct answer (isolating first principles thing as the only option), then if it can solve the problem you have a genuine problem solving architecture.

1 more reply

outofpaper3y ago

Often it can actually solve more complex problems but needs to have its "hand held". Essentially the model needs to be guided to/through problem solving techniques. We have to remember that LLM are literally inference engines. They default to providing us with probable results, probable responses. They can pe pulled away from these "knee jerk" responses.

dragonwriter3y ago

> Often it can actually solve more complex problems but needs to have its “hand held”. Essentially the model needs to be guided to/through problem solving techniques.

While I haven’t done experiments with it hooked up to enough resources to really solve problems autonomously, providing it access to lookup information (e.g., searching wikipedia) and do simply computation (e.g., send python expressions to be evaluated) it figures out a lot more than just the chat interface alone without resources, without hand holding. I think autonomously solving problems where the necessary information is in the universe covered by training data and accessible resources is not unrealistic.

raydiatian3y ago

Right but if it needs its hand held, that ends up being a transcription task rather than a logical reasoning task. Like if you _tell it_ the solution to a coding job in detail, it can build you the complex entity you’re looking for. But if you just say for instance “write me a Python script that generates random chord changes (ex A#dim to Gmaj9b5)”, first of all it will just dump code without asking for clarification on requirements, and second of all even if you do give it further clarification on requirements the code won’t work without you explaining in depth the algorithm.

Although, that’s just a personal anecdote.

rvz3y ago

> However, the moment I ask a real question that goes beyond summarizing something which is covered thousands of times online, I am immediately let down.

I'm very sure I said this from the start, against the ridiculous hype. Summarization of existing text is the *only* safe use case for LLMs. Anything else is asking for disappointment.

We have already seen it used as a search engine and it confidently hallucinates incorrect information. We have seen it pretend to be a medical professional or a replacement attorney or lawyer and it has outright regurgitated nonsensical and dangerous advice - making itself completely unreliable for that use-case especially since (deep) neural networks in general are still the same black-boxes, unable to explain and reason about their own decisions; making them unsuitable for high risk applications and use-cases.

As for writing code, despite what the hype-squad tells you both GPT-4 and ChatGPT the ground reality is that it generates broken code from the start and cannot reason why it did that in the first place. Non-programmers wouldn't question its output where as an experienced professional would catch its errors immediately.

Due to its untrustworthiness, it means than now programmers have to check and review the output that has been generated by GPT-4 and ChatGPT every-time in their projects than before.

The AI LLM hype has only further exposed its limitations.

pottspotts3y ago

For a significant number of software developers, GPT and Github's Copilot have replaced StackOverflow, and even Googling more generally. It is more than an autocomplete, it is the best resource for software development by far, IMO. It's a tutor that's an expert in virtually every topic.

mftb3y ago

Yea it's not. Sorry to contradict, but it's not like that. In any kind of tutoring arrangement you're time with them is limited, and if they're any good, they don't just regurgitate limitless example code. Two of the most important decisions that an instructor has to make are, how much access to give you, and how much example material to give you, because the actual learning begins when you have to think for yourself, and you are forced to confront a black screen with a flashing cursor, and fill it with your own ideas. So interacting with ChatGPT may be a great experience, but it's not that. Maybe someday it will be.

dmreedy3y ago

Tutoring assumes the skill is valuable to learn, that there is a need for more people who know how to do it.

We don't really tutor people how to write too much assembly anymore, or hand-compile code. So if you're arguing that ChatGPT meets the definition of a tool, or a servant, better than a tutor, fair, but if you're further arguing that that makes it somehow less valuable than a tutor (in this case), I'm not sure I can come along there.

1 more reply

inopinatus3y ago

It really isn't. GPT-4 is certainly an improvement over previous language models, but when I vaingloriously gave it the questions from favourite self-answers on StackOverflow, only one completion was immediately correct. The remainder were variously suboptimal, poorly crafted, overdesigned, incomplete, or downright wrong, requiring multiple re-prompts to coax into usable condition. The they were all syntactically valid but tended to misconstrue the semantics and underestimate the capabilities of the programming environments concerned. Try it with your own, but to me it's more like coaching a bright but inexperienced junior developer with the "confidently incorrect" trait.

skepticATX3y ago

I have to completely disagree with this.

Where GPT-4 shines for me is when I have a project swimming around in my head that I want to work on for fun. It can get you off of the ground quickly, and for side projects the quality and correctness of the output isn't that important.

For professional software development, GPT-4 is still wrong way too often for me to feel comfortable using it. And it's not all that much faster than going straight to the source anyways.

arghnoname3y ago

When people just ask chatGPT for solutions and there's no community, a la stack overflow, where will it get the answers to future problems?

If chatGPT is too successful and people stop producing content because chatGPT is too successful, it might end up in a local optima that isn't so optimal.

svachalek3y ago

Well, the difference is coming from both directions. ChatGPT is pretty amazing, but Stack Overflow has been self destructing for many years ahead of this.

Likely the future of training these systems will come from interacting with their users and perhaps directly with the tools and compilers too. They can learn from that without needing a new corpus of human-human interactions.

mattdeboard3y ago

No, even expert tutors know how to say “I don’t know” in the face of uncertainty, instead of remorselessly spitting up nonsense as language models do.

boringuser23y ago

I don't agree.

I still use stack overflow regularly as an engineer.

Sometimes GPT-4 will have a quicker tailor-fit answer, but sometimes it will flounder as well.

anothernewdude3y ago

Expert as of 2021, which is obsolete for many software dev purposes, not that SO is much better.

dbrueck3y ago

Similarly, when I think of ChatGPT as a really cool and advanced search engine frontend, its behavior - including its limitations and its failures - make the most sense to me.

seba_dos13y ago

It's a language model, not a search engine. It doesn't work well as one unless integrated into an actual search engine, like Bing does. Without such integration, it's much closer to human memory than search engine - it will recall stuff it has seen many times pretty well and completely fail at stuff it just glanced over once, filling any gaps with made up stuff like a kid on an exam hoping to get at least a few points with their wild guesses.

dbrueck3y ago

Yeah, I think we're talking about different things (and per my comment, I didn't say that it was a search engine). I'm reasonably well aware of what it is and what it's made of; I'm talking about a mental model for understanding and predicting when and why it works well vs when it doesn't.

And what I've found so far is that when I place it in the same mental bucket as the interface to a modern search engine (not the search engine itself, but the interface for both input and output), it actually fits in pretty well there in many ways. Not in every way, of course, but things like the nuances of crafting prompts and how a scarcity or abundance of reference material affects its output.

1 more reply

HWR_143y ago

> a really cool and advanced search engine frontend

This is the saddest version of ChatGPT I can imagine. I found that as search engines emulated natural language, their results got steadily worse.

I just want the Google results and interface from a long time ago.

dbrueck3y ago

> I found that as search engines emulated natural language, their results got steadily worse

I would wager that that has not been the experience for the general population (read: non-technical people) and/or that degradation of results has not been because of emulating natural language but because of other factors (like advertising dollars).

Search engines have become incredibly more accessible for non-techies during the past 3 decades. Sure, even today a techie is usually able to coax higher quality results out of a search engine, but it's still a pretty recent advancement that an average Joe can just announce their question out loud and a device on the shelf will not only figure out what they are asking with a decent degree of accuracy, but it will also go search for something relevant, extract an answer, and then speak it back to the user in a pretty sensible way.

It is in this senses in particular that ChatGPT feels like a natural progression for search engines.

1 more reply

seba_dos13y ago

> I am immediately let down

Why? I'm not sure how could you expect anything else in the first place.

Closi3y ago

I think one main failure in the framing of these papers (and discussion of LLMs more broadly) is that the abstract says that GPT4 ‘struggles’ with logical reasoning:

> ChatGPT and GPT-4 do relatively well on well-known datasets […] however, the performance drops significantly when handling newly released and out-of-distribution [where] Logical reasoning remains challenging for ChatGPT and GPT-4

But reading the paper the challenges it is failing on are ones that I wager the average human would fail on too (at least a good portion of the time).

The paper might strictly be accurate, but I think we should try and bring these papers back to a real-world context - which is that it’s probably operating above your average human at these tasks.

Is superhuman/genius-level capability really required before we say the LLMs are any good?

(I see this view on HN too - statements like ‘LLMs can’t create novel maths theorems!’ as an argument that LLMs aren’t good at reasoning, disregarding that most humans today can’t find novel/undiscovered maths theorems)

svachalek3y ago

If you really force it to reason, rather than regurgitate arguments from its training set, you will find it is nowhere near the genius line. Make up some rules and have it try to answer questions according to the rules. In my experiments I feel it's something like a 4 or 5 year old child both in its logical limitations and penchant for distraction.

However it's important to note one VERY important thing -- this is not a system that is designed to reason! At all, as far as I know. That just fell out of its ability for language somehow. So to just accidentally be able to reason like a 4 year old human (which are vastly clever compared to the adult of any other animal species I'm aware of) is incredibly impressive and I think the next obvious step is to couple this tech together with some classic computing, which has far exceeded human capabilities for logic and reason for decades already. If ChatGPT has some secondary system for reasoning and just uses the LLM for setting up problems and reading results, I think it could reach superhuman levels of reasoning quite easily.

bakuninsbart3y ago

Agreed, but let's not forget that Carnap started his AI company last year with the express goal of reaching AGI comparable to a 'retarded toddler' by 2030. Relatively simple generative AIs have come far far further than anyone really anticipated, and it is quite unclear if the last 20% to avg-human-level AGI will be much harder, impossible, or also suddenly be solved. I mean, hell, GPT4s context space is still relatively small, it doesn't have a memory, and is still producing quite impressive results in simple reasoning tasks.

stephendause3y ago

Re: your last sentence, I'm fairly certain this would fall under the category of neuro-symbolic AI. It, too, seems to me like the logical next step.

https://en.m.wikipedia.org/wiki/Neuro-symbolic_AI

dmreedy3y ago

> this is not a system that is designed to reason

For what it's worth, neither are we, really. Not disagreeing with anything you're saying, just musing.

> superhuman levels of reasoning

This one has always stumped me a bit though. I'm not quite sure what that looks like. Laplace's Demon?

pottspotts3y ago

The goal posts for AI are moving quickly, and in my mind, a lot of the criticism os too shallow.

People want it to perform better than any expert human at any possible subject before it's considered "real AI". It isn't enough for critics for it to be better than the average person at virtually everything its put to the test on.

It seems like there is some resentment and almost anger at this technology, particularly with the artistic AIs like Midjourney. I can understand that more readily, but what's the real beef with ChatGPT?

mrshadowgoose3y ago

> It seems like there is some resentment and almost anger at this technology, particularly with the artistic AIs like Midjourney. I can understand that more readily, but what's the real beef with ChatGPT?

People seem to have a real tough time accepting that human brains might not be that special. They see things like GPT-4, and tend to fall into soothing mental traps to rationalize that innate but baseless rejection. I actually view all the sustained anger and resentment as a signal that we are making meaningful inroads into AGI, as it means that people are actually being impacted.

One of the most common mental traps is "It's just fancy autocomplete." People tend to stop there and don't proceed to consider that the veracity of that claim is irrelevant. Autocomplete or not, GPT-4 seems to be able to provide meaningful assistance to certain workflows that were previously only within the bounds of human cognition.

> People want it to perform better than any expert human at any possible subject before it's considered "real AI". It isn't enough for critics for it to be better than the average person at virtually everything its put to the test on.

It's quite amusing that some people have moved their goalposts to "well it's not a superintelligence, therefore it's worthless". Simultaneously, it's highly depressing, because it means various actors will likely achieve AGI while the rest of us are still bickering about autocomplete and Chinese rooms.

dTal3y ago

What we are seeing is the inevitable backlash against a program that at first glance can do literally everything you ask it to in plain English.

We don't exactly know what it can and can't do, a property which in a computer program at any rate is deeply mysterious and unusual. It initially gives the appearance of being a human which knows everything. This leads a lot of people to angrily declare that its appearance is deceptive, and in searching for words to describe in exactly what way it falls short, they incorporate flawed intuition on what it is capable of. So there's a lot of back and forth right now as we collectively swap memes to try and make sense of such a dramatic development.

majormajor3y ago

I think ChatGPT's user interface is particularly suited for confusion and debate about that. We've called obviously-more-specifically-purpose-built things "AI" or "enhanced with AI" or such for years, somewhat interchangeably with other terms like "deep learning" or "machine learning." There's that old saw about "it's AI until it works and then it's just an computer science" or somesuch.

And many of those things are worse at their task than a person except for speed and scaling. Can a machine be fooled by dazzle for recognizing a face in a way a human can't? Sure, but nobody is willing to pay for a human to go through everyone's photo albums...

But does ChatGPT "use AI" as a tool in the same sense that Spotify's recommendations "use AI" or is it "an AI" in the sense that it's an independent consciousness/agent?

This is the first time so many people have disagreed on that part. And that skews the debate into "a person is better" vs talking about if a person is even practical in most of the situations we'd use this.

skybrian3y ago

You're framing this as if there were a single yes-or-no question that we should all agree on. (Are the LLM's "any good?")

But in real-world contexts, there are some tasks that just about anyone could do, others where "average" human performance isn't good enough and you need to hire an expert, and also some jobs that can only be done by machine.

So it seems like the bar should be set based on what you think is necessary for whatever practical application you have in mind?

If it's just a game, beating an average chess player, someone who is really good, or the best in the world are different milestones. And for chess there is an ELO ranking system that lets you answer this more precisely, too.

A paper about how well chatbots do on some reasoning tests can't answer this for you.

Closi3y ago

Not a simple 'yes-or-no' question, but more about the framing and where the benchmark is.

When they conclude that GPT4 "does not perform astonishingly well" - what is this compared to?

They never define what 'doing well' looks like, were not able to identify an application that does better than GPT4, and also were not able to say what a human benchmark would be if given the same task.

I can say though that I read the sample question and got it wrong too, so these aren't trivial questions we are giving GPT4.

So based on this, I just don't really understand how they can support their conclusion that it "does not perform astonishingly well".

skybrian3y ago

You’re right that they don’t compare to people at all, and the benchmarks don’t show performance on a practical application. And I agree that the last sentence isn’t great, but I don’t think it’s that important. I guess they were hoping it would do better on the benchmarks? It’s not an objective statement.

You don’t read a paper for its conclusion. A good question to ask about a scientific paper is “what did they actually do?” In this case, they asked ChatGPT (presumably GPT3.5) and GPT4 a bunch of logical reasoning questions from some benchmarks and compared the benchmark scores to RoBERTa. That’s it. Running benchmarks can be useful, but how much you care about the benchmarks is up to you.

Higher scores are better, so it does seem promising that GPT4 got more questions right. The scores aren’t that meaningful me, but it seems like it’s objective confirmation that GPT4 is better than previous systems on logical reasoning?

Maybe the benchmark scores are more meaningful to someone else? What else have these benchmarks been used for?

1 more reply

TOMDM3y ago

I think a lot of these LLM benchmarks should include a human avg, otherwise I don't really have a frame of reference other than personal experience with the models.

jltsiren3y ago

Human average can be a misleading statistic, because the average human is useless for almost everything. In almost every job, the average person doing the job is well above the average (in the general population) for that particular job.

Closi3y ago

I think this just demonstrates how the goalposts are shifting though.

Until pretty recently most people would probably say “the average human is very flexible at solving reasoning tasks compared to machines which find reasoning incredibly challenging“.

Now it’s “well of course this AI which wasn’t specifically trained for verbal reasoning can beat an average human at verbal reasoning - humans are useless at almost everything!”

Your goalpost seems to be that GPT needs to be better than experts in their field to be considered “good” at something - but I think it’s just interesting to reflect that that’s the benchmark we are applying now.

jillesvangurp3y ago

People aren't that good at logic either. So, gpt-4 not being great at this is maybe not that surprising.

Probably the best feature of gpt-4 is the ability to use tools. For example, it may not be that good at calculating things. But it can use a calculator. And if you think about it, a lot of people (including mathematicians) aren't actually that good at calculating either. We all learn it in school and then we forget much of it. That's why we have calculators. It's not a big deal.

Gpt-4 is more than capable of knowing the best tool for the job. Figuring out how to use it isn't that hard. You can actually ask it "what's the best tool for X", get a usable response, and then ask a follow up question to produce a script in the language of your choosing that demonstrates how to use it, complete with unit tests. Not a hypothetical, I've been doing that in the past few weeks and I've been getting some usable results.

And that's put me in a mind of wondering what will happen once we start networking all these specialized AIs and tools together. It might not be able to do everything by itself but it can get quite far figuring out requirements and turning those into running code. It's not that big of a leap from answering questions about how to do things to actually building programs that do things.

causality03y ago

They're good in "memory" reasoning but terrible in deductive reasoning. Like if you say there's a sign in front of a door saying "push" it will tell you you need to push the door, but if you say there was a powerful wind and you see a sign saying "pull" laying on the ground on the other side of a glass door it has no idea if you should push or pull.

cjbprime3y ago

I guess I'm with the LLM on this one, since I can't follow your example. Did the sign flip over while it was falling? Did the sign fall towards or away from the glass door that I am on the other side of? Where are the doorhandles?

Can you write this example in a way that's more comprehensible to humans, and then we can ask GPT-4 about it?

causality03y ago

The sign only has one side. It's a sign saying "pull" that was knocked off the other side of a glass door by the wind.

progrus3y ago

Will we ever get apologies from the AI-Foom crew for losing their marbles and riling people up about the word calculator?

micromacrofoot3y ago

word calculator is a more impressive title than I’d grant some people

dmz733y ago

LLMs are just programs that can produce human-like language output based on human-like language input and that calling them AI of any kind is greatly overstating their capabilities. There is no "reasoning" or "understanding" here, there is just a giant ball of mud full of auto-generated if-then-else like code with calls to random number function peppered around.

The two main problems I see with attributing AI to these programs are: 1. People will assume they are receiving intelligent response they can rely on without sanity checking. This is different than receiving the same response from other people because one learns to know who to trust and when. You can never trust these programs. 2. If/when real AI emerges it will be treated poorly because most people will assume it is the same "brainless" AI they were sold so many times before. In that respect the treatment of real AI will be equivalent to child abuse or slavery and will result in another giant black mark in human history.

j / k navigate · click thread line to collapse

60 comments

RayVR3y ago

This aligns well with my personal experience using gpt-4.

However, the moment I ask a real question that goes beyond summarizing something which is covered thousands of times online, I am immediately let down.

raydiatian3y ago

I feel like AlphaGo ideas and some variation on MCTS is more likely to produce a solid problem solving architecture.

famouswaffles3y ago

raydiatian3y ago

1 more reply

outofpaper3y ago

dragonwriter3y ago

> Often it can actually solve more complex problems but needs to have its “hand held”. Essentially the model needs to be guided to/through problem solving techniques.

raydiatian3y ago

Although, that’s just a personal anecdote.

rvz3y ago

> However, the moment I ask a real question that goes beyond summarizing something which is covered thousands of times online, I am immediately let down.

I'm very sure I said this from the start, against the ridiculous hype. Summarization of existing text is the *only* safe use case for LLMs. Anything else is asking for disappointment.

Due to its untrustworthiness, it means than now programmers have to check and review the output that has been generated by GPT-4 and ChatGPT every-time in their projects than before.

The AI LLM hype has only further exposed its limitations.

pottspotts3y ago

mftb3y ago

dmreedy3y ago

Tutoring assumes the skill is valuable to learn, that there is a need for more people who know how to do it.

1 more reply

inopinatus3y ago

skepticATX3y ago

I have to completely disagree with this.

For professional software development, GPT-4 is still wrong way too often for me to feel comfortable using it. And it's not all that much faster than going straight to the source anyways.

arghnoname3y ago

When people just ask chatGPT for solutions and there's no community, a la stack overflow, where will it get the answers to future problems?

If chatGPT is too successful and people stop producing content because chatGPT is too successful, it might end up in a local optima that isn't so optimal.

svachalek3y ago

Well, the difference is coming from both directions. ChatGPT is pretty amazing, but Stack Overflow has been self destructing for many years ahead of this.

mattdeboard3y ago

No, even expert tutors know how to say “I don’t know” in the face of uncertainty, instead of remorselessly spitting up nonsense as language models do.

boringuser23y ago

I don't agree.

I still use stack overflow regularly as an engineer.

Sometimes GPT-4 will have a quicker tailor-fit answer, but sometimes it will flounder as well.

anothernewdude3y ago

Expert as of 2021, which is obsolete for many software dev purposes, not that SO is much better.

dbrueck3y ago

Similarly, when I think of ChatGPT as a really cool and advanced search engine frontend, its behavior - including its limitations and its failures - make the most sense to me.

seba_dos13y ago

dbrueck3y ago

1 more reply

HWR_143y ago

> a really cool and advanced search engine frontend

This is the saddest version of ChatGPT I can imagine. I found that as search engines emulated natural language, their results got steadily worse.

I just want the Google results and interface from a long time ago.

dbrueck3y ago

> I found that as search engines emulated natural language, their results got steadily worse

It is in this senses in particular that ChatGPT feels like a natural progression for search engines.

1 more reply

seba_dos13y ago

> I am immediately let down

Why? I'm not sure how could you expect anything else in the first place.

Closi3y ago

I think one main failure in the framing of these papers (and discussion of LLMs more broadly) is that the abstract says that GPT4 ‘struggles’ with logical reasoning:

But reading the paper the challenges it is failing on are ones that I wager the average human would fail on too (at least a good portion of the time).

The paper might strictly be accurate, but I think we should try and bring these papers back to a real-world context - which is that it’s probably operating above your average human at these tasks.

Is superhuman/genius-level capability really required before we say the LLMs are any good?

svachalek3y ago

bakuninsbart3y ago

stephendause3y ago

Re: your last sentence, I'm fairly certain this would fall under the category of neuro-symbolic AI. It, too, seems to me like the logical next step.

https://en.m.wikipedia.org/wiki/Neuro-symbolic_AI

dmreedy3y ago

> this is not a system that is designed to reason

For what it's worth, neither are we, really. Not disagreeing with anything you're saying, just musing.

> superhuman levels of reasoning

This one has always stumped me a bit though. I'm not quite sure what that looks like. Laplace's Demon?

pottspotts3y ago

The goal posts for AI are moving quickly, and in my mind, a lot of the criticism os too shallow.

mrshadowgoose3y ago

dTal3y ago

What we are seeing is the inevitable backlash against a program that at first glance can do literally everything you ask it to in plain English.

majormajor3y ago

But does ChatGPT "use AI" as a tool in the same sense that Spotify's recommendations "use AI" or is it "an AI" in the sense that it's an independent consciousness/agent?

skybrian3y ago

You're framing this as if there were a single yes-or-no question that we should all agree on. (Are the LLM's "any good?")

So it seems like the bar should be set based on what you think is necessary for whatever practical application you have in mind?

A paper about how well chatbots do on some reasoning tests can't answer this for you.

Closi3y ago

Not a simple 'yes-or-no' question, but more about the framing and where the benchmark is.

When they conclude that GPT4 "does not perform astonishingly well" - what is this compared to?

I can say though that I read the sample question and got it wrong too, so these aren't trivial questions we are giving GPT4.

So based on this, I just don't really understand how they can support their conclusion that it "does not perform astonishingly well".

skybrian3y ago

Maybe the benchmark scores are more meaningful to someone else? What else have these benchmarks been used for?

1 more reply

TOMDM3y ago

I think a lot of these LLM benchmarks should include a human avg, otherwise I don't really have a frame of reference other than personal experience with the models.

jltsiren3y ago

Closi3y ago

I think this just demonstrates how the goalposts are shifting though.

Until pretty recently most people would probably say “the average human is very flexible at solving reasoning tasks compared to machines which find reasoning incredibly challenging“.

Now it’s “well of course this AI which wasn’t specifically trained for verbal reasoning can beat an average human at verbal reasoning - humans are useless at almost everything!”

jillesvangurp3y ago

People aren't that good at logic either. So, gpt-4 not being great at this is maybe not that surprising.

causality03y ago

cjbprime3y ago

Can you write this example in a way that's more comprehensible to humans, and then we can ask GPT-4 about it?

causality03y ago

The sign only has one side. It's a sign saying "pull" that was knocked off the other side of a glass door by the wind.

progrus3y ago

Will we ever get apologies from the AI-Foom crew for losing their marbles and riling people up about the word calculator?

micromacrofoot3y ago

word calculator is a more impressive title than I’d grant some people

dmz733y ago

j / k navigate · click thread line to collapse