Define reasoning. Because in my definition GPT 4 could reason without doubt. It definitely can't reason better than experts in the field, but it could reason better than say interns.
"Say I have a container with 50 red balls and 50 blue balls, and every time I draw a blue ball from the container, I add two white balls back. After drawing 100 balls, how many of each different color ball are left in the container? Explain why."
... because on GPT 3.5 the answer begins like the below and then gets worse:
"Let's break down the process step by step:
Initially, you have 50 red balls and 50 blue balls in the container.
1) When you draw a blue ball from the container, you remove one blue ball, and you add two white balls back. So, after drawing a blue ball, you have 49 blue balls (due to removal) and you add 2 white balls, making it a total of 52 white balls (due to addition) ..."
If I was hiring interns this dumb, I'd be in trouble.
EDIT: judging by the GPT-4 responses, I remain of the opinion I'd be in trouble if my interns were this dumb.
So I clarified to ChatGPT that the drawing is random. And it replied: "The exact numbers can vary based on the randomness and can be precisely modeled with a simulation or detailed probabilistic analysis."
I asked for a detailed probabilistic analysis and it gives a very simplified analysis. And then basically says that a Monte Carlo approach would be easier. That actually sounds more like most people I know than most people I know. :-)
Seems like quite a difficult question to compute exactly.
I reworded the question to make it clearer and then it was able to simulate a bunch of scenarios as a monte carlo simulation. Was your hope to calculate it exactly with dynamic programming? GPT-4 was not able to do this, but I suspect neither could a lot of your interns.
These are very good questions that anyone with the ability to reason would ask if given this problem.
You're asking GPT to do maths in its head, the AI equivalent of a person standing in the middle of the room with no tools and getting grilled in a oral examination of their knowledge.
Instead, collaborate with it, while giving it the appropriate tools to help you.
I asked it to write a Monte Carlo simulation of the problem in Wolfram Mathematica script. It did this about 10-100x faster than I would have been able to. It made a few small mistakes with the final visualisation, but I managed to get it to output a volumetric plot showing the 3D scatter plot of the histogram of possible outcomes.
I even got it to save a video of the plot rotating: https://streamable.com/2aphbz
All told, I'd say it's a decent answer.
Edit: I took it to completion:https://chat.openai.com/c/6cdd92f1-487a-4e1c-ab94-f2bdbf282d...
These were the first responses each time, with no massaging/retires/leading answers. I will say it's not entirely there. I re-ran the initial question a few times afterwards and one was basically giberish.
e composition of what's left in the container.
There's a couple of scenarios, which depend on when you run out of blue balls:
1) If you draw all 50 blue balls first, for every blue ball you draw, you are adding 2 white balls. Hence, once you've drawn all the blue balls, you have replaced them with 100 white balls. This means you've drawn 50 balls total and there's now 100 white balls and 50 red balls (150 balls) in the container. Then, you continue and draw the next 50 balls. These could be any combination of white and red balls, we don't know which color is preferred when you continue drawing.
2a) If you draw all red balls in these next 50 draws, you would have 50 red balls out and 50 blue balls out. So, you would end with 100 white balls in the container.
2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.
2c) If you draw a mixture of red and white balls in these next 50 draws, the number leftover in the container would vary according to the specific combination drawn.
Remember, the order in which the balls are drawn matters to this question since we have a rule in place that changes the overall total (or population) of balls within the container. In each of these scenarios, you've drawn 50 times initially (all blue) and then 50 times again (in some unknown combination of red and white). You've drawn 100 times total and changed the number of white balls in the container from zero to an amount dependent on how many times you drew a white ball on your second round of 50 draws.
2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.
... so after removing 100 balls, I've removed 150 balls? And the 150 balls that I've removed are red, white and blue despite the fact that I removed 50 blue balls initially and then 50 white ones.
Seems like it reasons it's way to this answer at the end to me: Mind you, while averages are insightful, they don't capture the delightful unpredictability of each individual run. Would you like to explore this delightful chaos further, or shall we move on to other intellectual pursuits?
Took a bit of massaging and I enabled the Data Analysis plugin which lets it write python code and run it. It looks like the simulation code is correct though.
Uhm.
You see reasoning issues when you use more real world examples, rather than theoretical tests.
I had 4 failure states.
1) Summarization: It summarized 3 transcripts correctly, for the fourth it described the speaker as a successful VC. The speaker was a professor.
2) It was to act as a classifier, with a short list of labels. Depending on the length of text, the classifier would swap over to text gen. Other issues included novel labels, new variations of labels, and so on.
3) Agents - This died on the vine. Leave having to learn asynch, vector DBs or whatever. You can never trust the output of an LLM, so you can never chain agents.
4) I focused on using ChatGPT to complete a project. I hadnt touched HTML ever - the goal was to use ChatGPT to build the site. This would cover design, content, structure, development, hosting, and improvements.
I still have trauma. Wrong code, bad design, were base issues. If code was correct, it simply meant I had dug a deeper grave. I had anticipated 70% of the work being handled by ChatGPT, it ended up at 30% at the most.
ChatGPT is great IF you already are a subject expert - you can brush over the issues and move on.
"Hallucinations" is the little bit of string that you pull on, and the rest unravels. There are no hallucinations, only humans can hallucinate - because we have an actual ground truth to work with.
LLMs are only creating the next token. For them to reason, they must be holding structures and proxies in some data store, and actively altering it.
Its easier to see once you deal with hallucinations.
Example of basic problem: In a shop, there are 4 dolls of different heights P,Q,R and S. S is neither as tall as P nor as short as R. Q is shorter than S but taller than R. If Kittu wants to purchase the tallest doll, which one should she purchase? Think step by step.