The model provides surprisingly good responses on topics which I know are readily available online while being potentially troublesome to find the exact information I want. I have even found it useful when I know there is a tool for what I want but can’t recall the jargon used to find it via Google. Simply describing the rough idea is enough to get the model to spit out the jargon I need.
However, the moment I ask a real question that goes beyond summarizing something which is covered thousands of times online, I am immediately let down.
Is this just a result of the foundation of the model being the world best autocompletion engine? My assessment is “yes” and I don’t believe that any of the modifications coming, like plugins, will fundamentally change this.
Furthermore, I just don’t feel like the transformer architecture is suited for problem solving. Like I may just be a charlatan but self attention over the space of words does not seem like it’s going to be enough, and praying it falls out in emergent behavior if we can just add more parameters is… unscientific-ish? Now, if you could figure out a way to do self-attention over the space of concepts? Maybe you’ve got something.
I feel like AlphaGo ideas and some variation on MCTS is more likely to produce a solid problem solving architecture.
I'm very sure I said this from the start, against the ridiculous hype. Summarization of existing text is the *only* safe use case for LLMs. Anything else is asking for disappointment.
We have already seen it used as a search engine and it confidently hallucinates incorrect information. We have seen it pretend to be a medical professional or a replacement attorney or lawyer and it has outright regurgitated nonsensical and dangerous advice - making itself completely unreliable for that use-case especially since (deep) neural networks in general are still the same black-boxes, unable to explain and reason about their own decisions; making them unsuitable for high risk applications and use-cases.
As for writing code, despite what the hype-squad tells you both GPT-4 and ChatGPT the ground reality is that it generates broken code from the start and cannot reason why it did that in the first place. Non-programmers wouldn't question its output where as an experienced professional would catch its errors immediately.
Due to its untrustworthiness, it means than now programmers have to check and review the output that has been generated by GPT-4 and ChatGPT every-time in their projects than before.
The AI LLM hype has only further exposed its limitations.
Where GPT-4 shines for me is when I have a project swimming around in my head that I want to work on for fun. It can get you off of the ground quickly, and for side projects the quality and correctness of the output isn't that important.
For professional software development, GPT-4 is still wrong way too often for me to feel comfortable using it. And it's not all that much faster than going straight to the source anyways.
If chatGPT is too successful and people stop producing content because chatGPT is too successful, it might end up in a local optima that isn't so optimal.
I still use stack overflow regularly as an engineer.
Sometimes GPT-4 will have a quicker tailor-fit answer, but sometimes it will flounder as well.
This is the saddest version of ChatGPT I can imagine. I found that as search engines emulated natural language, their results got steadily worse.
I just want the Google results and interface from a long time ago.
Why? I'm not sure how could you expect anything else in the first place.
> ChatGPT and GPT-4 do relatively well on well-known datasets […] however, the performance drops significantly when handling newly released and out-of-distribution [where] Logical reasoning remains challenging for ChatGPT and GPT-4
But reading the paper the challenges it is failing on are ones that I wager the average human would fail on too (at least a good portion of the time).
The paper might strictly be accurate, but I think we should try and bring these papers back to a real-world context - which is that it’s probably operating above your average human at these tasks.
Is superhuman/genius-level capability really required before we say the LLMs are any good?
(I see this view on HN too - statements like ‘LLMs can’t create novel maths theorems!’ as an argument that LLMs aren’t good at reasoning, disregarding that most humans today can’t find novel/undiscovered maths theorems)
However it's important to note one VERY important thing -- this is not a system that is designed to reason! At all, as far as I know. That just fell out of its ability for language somehow. So to just accidentally be able to reason like a 4 year old human (which are vastly clever compared to the adult of any other animal species I'm aware of) is incredibly impressive and I think the next obvious step is to couple this tech together with some classic computing, which has far exceeded human capabilities for logic and reason for decades already. If ChatGPT has some secondary system for reasoning and just uses the LLM for setting up problems and reading results, I think it could reach superhuman levels of reasoning quite easily.
For what it's worth, neither are we, really. Not disagreeing with anything you're saying, just musing.
> superhuman levels of reasoning
This one has always stumped me a bit though. I'm not quite sure what that looks like. Laplace's Demon?
People want it to perform better than any expert human at any possible subject before it's considered "real AI". It isn't enough for critics for it to be better than the average person at virtually everything its put to the test on.
It seems like there is some resentment and almost anger at this technology, particularly with the artistic AIs like Midjourney. I can understand that more readily, but what's the real beef with ChatGPT?
People seem to have a real tough time accepting that human brains might not be that special. They see things like GPT-4, and tend to fall into soothing mental traps to rationalize that innate but baseless rejection. I actually view all the sustained anger and resentment as a signal that we are making meaningful inroads into AGI, as it means that people are actually being impacted.
One of the most common mental traps is "It's just fancy autocomplete." People tend to stop there and don't proceed to consider that the veracity of that claim is irrelevant. Autocomplete or not, GPT-4 seems to be able to provide meaningful assistance to certain workflows that were previously only within the bounds of human cognition.
> People want it to perform better than any expert human at any possible subject before it's considered "real AI". It isn't enough for critics for it to be better than the average person at virtually everything its put to the test on.
It's quite amusing that some people have moved their goalposts to "well it's not a superintelligence, therefore it's worthless". Simultaneously, it's highly depressing, because it means various actors will likely achieve AGI while the rest of us are still bickering about autocomplete and Chinese rooms.
We don't exactly know what it can and can't do, a property which in a computer program at any rate is deeply mysterious and unusual. It initially gives the appearance of being a human which knows everything. This leads a lot of people to angrily declare that its appearance is deceptive, and in searching for words to describe in exactly what way it falls short, they incorporate flawed intuition on what it is capable of. So there's a lot of back and forth right now as we collectively swap memes to try and make sense of such a dramatic development.
And many of those things are worse at their task than a person except for speed and scaling. Can a machine be fooled by dazzle for recognizing a face in a way a human can't? Sure, but nobody is willing to pay for a human to go through everyone's photo albums...
But does ChatGPT "use AI" as a tool in the same sense that Spotify's recommendations "use AI" or is it "an AI" in the sense that it's an independent consciousness/agent?
This is the first time so many people have disagreed on that part. And that skews the debate into "a person is better" vs talking about if a person is even practical in most of the situations we'd use this.
But in real-world contexts, there are some tasks that just about anyone could do, others where "average" human performance isn't good enough and you need to hire an expert, and also some jobs that can only be done by machine.
So it seems like the bar should be set based on what you think is necessary for whatever practical application you have in mind?
If it's just a game, beating an average chess player, someone who is really good, or the best in the world are different milestones. And for chess there is an ELO ranking system that lets you answer this more precisely, too.
A paper about how well chatbots do on some reasoning tests can't answer this for you.
When they conclude that GPT4 "does not perform astonishingly well" - what is this compared to?
They never define what 'doing well' looks like, were not able to identify an application that does better than GPT4, and also were not able to say what a human benchmark would be if given the same task.
I can say though that I read the sample question and got it wrong too, so these aren't trivial questions we are giving GPT4.
So based on this, I just don't really understand how they can support their conclusion that it "does not perform astonishingly well".
Probably the best feature of gpt-4 is the ability to use tools. For example, it may not be that good at calculating things. But it can use a calculator. And if you think about it, a lot of people (including mathematicians) aren't actually that good at calculating either. We all learn it in school and then we forget much of it. That's why we have calculators. It's not a big deal.
Gpt-4 is more than capable of knowing the best tool for the job. Figuring out how to use it isn't that hard. You can actually ask it "what's the best tool for X", get a usable response, and then ask a follow up question to produce a script in the language of your choosing that demonstrates how to use it, complete with unit tests. Not a hypothetical, I've been doing that in the past few weeks and I've been getting some usable results.
And that's put me in a mind of wondering what will happen once we start networking all these specialized AIs and tools together. It might not be able to do everything by itself but it can get quite far figuring out requirements and turning those into running code. It's not that big of a leap from answering questions about how to do things to actually building programs that do things.
Can you write this example in a way that's more comprehensible to humans, and then we can ask GPT-4 about it?
The two main problems I see with attributing AI to these programs are: 1. People will assume they are receiving intelligent response they can rely on without sanity checking. This is different than receiving the same response from other people because one learns to know who to trust and when. You can never trust these programs. 2. If/when real AI emerges it will be treated poorly because most people will assume it is the same "brainless" AI they were sold so many times before. In that respect the treatment of real AI will be equivalent to child abuse or slavery and will result in another giant black mark in human history.