While the reasoning may have been improved, this doesn't solve the problem of the model having no way to assess if what it conjures up from its weights is factual or not.
A lot of people use LLMs as a search engine. It makes sense - it's basically a lossy compressed database of everything its ever read, and it generates output that is statistically likely - varying degrees of likeliness depending on the temperature, as well as how many times the particular weights your prompt ends up activating.
The magic of LLMs, especially one like this that supposedly has advanced reasoning, isn't the existing knowledge in its weights. The magic is that _it knows english_. It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output. It's not _just_ an output engine. It's an engine that outputs.
Asking it about nuanced details in the corpus of data it has read won't give you good output unless it read a bunch of it.
On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.
Don't treat it as a database. Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.
That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”. LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.
With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding. With a LLM, I need to put a huge amount of thought into the prompt and have no idea whether the LLM understood what I’m asking and if it’s able to do it.
This is not an apt description of the system that insists the doctor is the mother of the boy involved in a car accident when elementary understanding of English and very little logic show that answer to be obviously wrong.
People, for the most part, know what they know and don't know. I am not uncertain that the distance between the earth and the sun varies, but I'm certain that I don't know the distance from the earth to the sun, at least not with better precision than about a light week.
This is going to have to be fixed somehow to progress past where we are now with LLMs. Maybe expecting an LLM to have this capability is wrong, perhaps it can never have this capability, but expecting this capability is not wrong, and LLM vendors have somewhat implied that their models have this capability by saying they won't hallucinate, or that they have reduced hallucinations.
You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say. It just statistically knows what a likely response would be.
Treat it as text completion and you can get more accurate answers.
That's the crux of the problem. Why and who would treat it as an intern? It might cost you more in explaining and dealing with it than not using it.
The purpose of an intern is to grow the intern. If this intern is static and will always be at the same level, why bother? If you had to feed and prep it every time, you might as well hire a senior.
i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.
chatgpt is fickle daily. most days its on point. some days its wearing a bicycle helmet and licking windows. kinda sucks i cant just zone out and daydream while working. gotta be checking replies for when the wheels fall off the convo.
I'm not up to date with these things because I haven't found them useful. But with what you said, and previous limitations in how much data they can retain essentially makes them pretty darn useless for that task.
Great learning tool on common subjects you don't know, such as learning a new programming-language. Also great for inspiration etc. But that's pretty much it?
Don't get me wrong, that is mindblowingly impressive but at the same time, for the tasks in front of me it has just been a distracting toy wasting my time.
There's not much evidence of that. It only marginally improved on instruction following (see livebench.ai) and it's score as a swe-bench agent is barely above gpt-4o (model card).
It gets really hard problems better, but it's unclear that matters all that much.
> A lot of people use LLMs as a search engine.
Except this is where LLMs are so powerful. A sort of reasoning search engine. They memorized the entire Internet and can pattern match it to my query.
I couldn't agree more, this is exactly the strength of LLMs that what we should focus on. If you can make your problem fit into this paradigm, LLMs work fantastic. Hallucinations come from that massive "lossy compressed database", but you should consider that part as more like the background noise that taught the model to speak English, and the syntax of programming languages, instead of the source of the knowledge to respond with. Stop anthropomorphizing LLMs, play to it's strengths instead.
In other words it might hallucinate a API but it will rarely, if ever, make a syntax error. Once you realize that, it becomes a much more useful tool.
I've found an amazing amount of success with a three step prompting method that appears to create incredibly deep subject matter experts who then collaborate with the user directly.
1) Tell the LLM that it is a method actor, 2) Tell the method actor they are playing the role of a subject matter expert, 3) At each step, 1 and 2, use the technical language of that type of expert; method actors have their own technical terminology, use it when describing the characteristics of the method actor, and likewise use the scientific/programming/whatever technical jargon of the subject matter expert your method actor is playing.
Then, in the system prompt or whatever logical wrapper the LLM operates through for the user, instruct the "method actor" like you are the film director trying to get your subject matter expert performance out of them.
I offer this because I've found it works very well. It's all about crafting the context in which the LLM operates, and this appears to cause the subject matter expert to be deeper, more useful, smarter.
Well, I am a naive but intelligent intern (well, senior developer). So in this framing, the LLM can’t do more than I can already do by myself, and thus far it’s very hit or miss if I actually save time, having to provide all the context and requirements, and having to double-check the results.
With interns, this at least improves over time, as they become more knowledgeable, more familiar with the context, and become more autonomous and dependable.
Language-related tasks are indeed the most practical. I often use it to brainstorm how to name things.
This isn’t true because, as you can read in the first sentence of the post you’re responding to, GP did give it a task like you recommend here
> Provide it data, give it a task, and let it surprise you with its output.
And it fails the task. Specifically it fails it by hallucinating important parts of accomplishing it.
> hallucinates non-existing libraries and functions
This post only makes sense if your advice to “let it surprise you with its output” is mandatory, like you’re using it wrong if you do not make yourself feel impressed by it.
It’s still changing things to be several versions old from its innate kb pattern-matching or whatever you want to call it. I find that pretty disappointing.
Just like copilot and gpt4, it’s changing `add_systems(Startup, system)` to `add_startup_system(system.sytem())` and other pre-schedule/fanciful APIs—things it should have in context.
I agree with your approach to LLMs, but unfortunately “it’s still doing that thing.”
PS: and by the time I’d done those experiments, I ran out of preview, resets 5 days from now. D’oh
GPT-4o is wonderful as a search engine if you tell it to google things before answering (even though it uses bing).
So mostly useless then?
EDIT: Note this was run over a dataset of short stories rather than the novels since the API errors out with very long contexts like novels.
There's no way you can "reason" a correct answer to "list the tracklisting of some obscure 1991 demo by a band not on Wikipedia." You either know or you don't.
I usually test new models with questions like "what are the levels in [semi-famous PC game from the 90s]?" The release version of GPT-4 could get about 75% correct. o1-preview gets about half correct. o1-mini gets 0% correct.
Fair enough. The GPT-4 line aren't meant to be search engines or encyclopedias. This is still a useful update though.
You're using a calculator as a search engine.
It doesn't even know mildly obsecure facts that are on the internet.
For example last night I was trying to do something with C# generics and it confidently told me I could use pattern matching on the type in a switch statwmnt, and threw out some convincing looking code.
You can't, it's impossible. It wàa completely wrong. When I told that this, it told me I was right, and proceeded to give me code that was even more wrong.
This is an obscure, but well documented, part of the spec.
So it's not about facts that aren't on the internet, it's just bad at facts fullstop.
What it's good at is facts the internet agrees on. Unless the internet is wrong. Which is not always a good thing with the way the language it uses to speak is so confident.
If you want to fuck with AI models as a bunch of code questions on Reddit, GitHub and SO with example code saying 'can I do X'. The answer is no, but chatgpt/codepilot/etc. will start spewing out that nonsense as if it's fact.
As for non-proframming, we're about to see the birth of a new SEO movement of tricking AI models to believe your 'facts'.
That's the frustrating thing. LLMs don't materially reduce the set of problems where I'm running against a wall or have trouble finding information.
After that you switch to Claude Soñnet and after sometime it also gets stuck.
Problem with LLM is that they are not aware of libraries.
I've fed them library version, using requirements.txt, python version I am using etc...
They still make mistakes and try to use methods which do not exist.
Where to go from here? At this point I manually pull the library version I am using and go to its docs, I generate a page which uses the this library correctly (then I feed that example into LLM)
Using this approach works. Now I just need to automate it so that I don't have to manually find the library, create specific example which uses the methods I need in my code!
Directly feeding the docs isn't working well either.
That seems to eliminate a lot of the issues, though it's not a seamless experience, and it adds another step of having to put the library docs in a text file.
Alternatively, cursor can fetch a web page, so if there's a good page of docs you can bring that in by @ the web page.
Eventually, I could imagine LLMs automatically creating library text doc files to include when the LLM is using them to avoid some of these problems.
It could also solve some of the issues of their shaky understanding of newer frameworks like SvelteKit.
I’m in the “probabilistic token generators aren’t intelligence” camp so I don’t actually believe in AGI, but I’ll be honest the never ending rumors / chatter almost got to me
Remember, this is the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for
Maybe this has been extensively discussed before, but since I've lived under a rock: which parts of intelligence do you think are not representable as conditional probability distributions?
This never happened. No one said it happened.
"the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for"
The Information reported someone at a meeting suggested this for future models, not specifically Strawberry, and that it would probably not actually be that high.
In public coding AI comparison tests, results showed 4o scoring around 35%, o1-preview scoring ~50% and o1 scoring ~85%.
o1 is not yet released, but has been run through many comparison tests with public results posted.
The system doesn’t become useless if it takes 2 tries instead of 1 to get it right
Still saves an incredible amount of time vs doing it yourself
It is perfectly possible to have code that runs without errors but gives a wrong answer. And you may not even realise it’s wrong until it bites you in production.
And a few times the amount of time I spent trying to coax a correct answer out of AI trumped any potential savings I could've had
It's like when an LLM gives you a wrong answer and all it takes is "are you sure?" to get it to generate a different answer.
Of course the underlying problem of the model not knowing what it knows or doesn't know persists, so giving it the ability to reflect on what it just blurted out isn't always going to help. It seems the next step is for them to integrate RAG and tool use into this agentic wrapper, which may help in some cases.
Oooh... oohhh!! I just had a thought: By now we're all familiar with the strict JSON output mode capability of these LLMs. That's just a matter of filtering the token probability vector by the output grammar. Only valid tokens are allowed, which guarantees that the output matches the grammar.
But... why just data grammars? Why not the equivalent of "tab-complete"? I wonder how hard it would be to hook up the Language Server Protocol (LSP) as seen in Visual Studio code to an AI and have it only emit syntactically valid code! No more hallucinated functions!
I mean, sure, the semantics can still be incorrect, but not the syntax.
Both abilities are powerful, but they are very different powers.
It's your right to dismiss it, if you want, but if you want to get some value out of it, you should play to it's strengths and not look for things that it fails at as a gotcha.
I could imagine OpenAI might allow their own vetted tools to be used, but perhaps it will be a while (if ever) before developers are allowed to hook up their own tools. The risks here are substantial. A model fine-tuned to run chain-of-thought that can answer graduate level physics problems at an expert level can probably figure out how to scam your grandma out of her savings too.
This comment makes no sense in the context of what an LLM is. To even say such a thing demonstates a lack of understandting of the domain. What we are doing here is TEXT COMPLETION, no one EVER said anything about being accurate and "true". We are building models that can complete text, what did you think an LLM was, a "truth machine"?
I've tried asking it factual information, and it asserts that it's incorrect but it will definitely hallucinate questions like the above.
You'd think the reasoning would nail that and most of the chain-of-thought systems I've worked on would have fixed this by asking it if the resulting answer was correct.
> Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.
Results are "strong" but can't be felt by the user? What does that even mean?
But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine.
"This hammer hammers better, but in most cases it's not obvious how better it is. But when you stumble upon a very specific kind of nail, man does it feel magical! We need to craft more of those weird nails to help the world understand the value of this hammer."
But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?
The human preference is not that good of a proxy measurement: for instance, it can be gamed by making the model more assertive, causing the human error-spotting ability to decrease a lot [0].
So what he's really saying is that non-rigorous human vibe checks (like those LMSys Chatbot Arena is built on, although I love it) won't cut it anymore to evaluate models, because now models are past that point. Just like you can't evaluate how smart a smart person really is in a 2min casual conversation.
and you can absolutely evaluate how smart someone is in a 2min casual conversation. You wont be able to tell how well they are in some niche topic, but %insert something about different flavors of intelligence and how they do not equate do subject matter expertise%
Not every conversation you have with a PhD will make it obvious that that person is a PhD. Someone can be really smart, but if you don't see them in a setting where they can express it, then you'll have no way of fully assessing their intelligence. Similarly, if you only use OAI models with low-demand prompts, you may not be able to tell the difference between a good model and a great one.
It explicitly says "Results on AIME and GPQA are really strong". So I would assume it means it can get (statistically significantly, I assume) better score in AIME and GPQA benchmarks compared to 4o.
But it doesn't feel right. It's unlikely the screwdriver would come first, and then people would go around looking for things to use it with, no?
Because OpenAI needs a steady influx of money, big money. In order to do so, they have to convince the people who are giving them money that they are the best. An objective way to achieve this is by benchmarking. But once you enter this game, you start optimizing for benchmarks.
At the same time, in the real world, Anthropic is following them in huge leaps and for many users Claude 3.5 is already the default tool for daily work.
From a user perspective too, I was a subscriber from the first day of gpt4 until about a month ago. I thought about subscribing for the month to check this out but I am tired of the OpenAI experience.
Where is Sora? Where is the version of chatgpt that responds in real time to your voice? Remember the gpt4 demo that you would draw a website on a napkin?
How about Q* lol. Strawberry/Q*/o1, "it is super dangerous, be very careful!"
Quietly, Anthropic has just kicked their ass without all the hype and I am about to go work in sonnet instead even bothering to check o1 out.
This means it often doesn't provide the answer the user is looking for. In my opinion, it's an alignment problem, people are very presumptuous and leave out a lot of detail in their request. Like the "which is bigger - 9.8 or 9.11? question, if you ask "numerically which is bigger - 9.8 or 9.11?" It gets the correct answer, basically it prioritizes a different meaning for bigger.
> But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine. But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?
Without better questions we can't test and prove that it is getting more intelligent or is just wrong. If it is more intelligent than us it it might provide answers that don't make sense to us but are actually clever, 4d chess as they say. Again an alignment problem, better questions aid with solving that.
Reading his comments without framing it in that context makes it come off pretty badly - humans failing to understand what is being said because they don't have context.
"One of the biggest traps for engineers is optimizing a thing that shouldn't exist." (from Musk I believe)
Speaking with AI maxis it’s easy:
The AI is always right
You are always wrong
If AI might enable something dangerous, it was already possible by hand, scale is irrelevant
But also AI enables many amazing things not previously possible, at scale
If you don’t get the answers you want, you’re prompting it wrong. You need to work harder to show how much better the AI is. But definitely, it cannot make things worse at scale in any way. And anyone who wants regulations to even require attribution and labeling, is a dangerous luddite depriving humanity of innovations.
This is the prompt I gave:
simplify this rust library by removing the different sized enums and only using the U8 size. For example MasksByByte is an enum, change it to be an alias for the U8 datatype. Also the u256 datatype isn't required, we only want U8, so remove all references to U256 as well.
The original crate is trie-hard [1][2] and I forked it and put the models attempts in the fork [3]. I also quickly wrote it up at [4]
[1] https://blog.cloudflare.com/pingora-saving-compute-1-percent...
[2] https://github.com/cloudflare/trie-hard
[3] https://github.com/kpm/trie-hard-simple/tree/main/attempts
[4] https://blog.reyem.dev/post/refactoring_rust_with_chatgpt-o1...
So, it seems like anything that requires some actual thought and problem-solving is tough for it to answer.
I'm sure it's just a matter of time before devs are out of work but it seems like we'll be safe for another few years anyway.
1. A LLM (probably a finetuned GPT-4o) trained specifically to read and emit good chain-of-thought prompts.
2. Runtime code that iteratively re-prompts the model with the chain of thought so far. This sounds like it includes loops, branches and backtracking. This is not "the model", it's regular code invoking the model. Interesting that OpenAI is making no attempt to clarify this.
I wonder where the real innovation here lies. I've done a few informal stabs with #2 and I have a pretty strong intuition (not proven yet) that given the right prompting/metaprompting model you can do pretty well at this even with untuned LLMs. The end game here is complex agents with arbitrary continuous looping interleaved with RAG and tool use.
But OpenAI's philosophy up until now has almost always been "The bitter lesson is true, the model knows best, just put it in the model." So it's also possible that the prompt loop has no special sauce and that the capabilities here do come mostly from the model itself.
Without being able to inspect the reasoning tokens, we can't really get a lot of info about which is happening.
Just like for Chess or Go you don't train a supervised model by giving it the exact move it should do in each case, you use RL techniques to learn which moves are good based on end results of the game.
In practice, there probably is some supervision to enforce good style and methodology. But the key here is that it is able to learn good reasoning without (many) human examples, and find strategies to solve new problems via self-learning.
If that is the case it is indeed an important breakthrough.
Practically it's come to mean just sanitization... "don't say something nasty or embarrassing to users." But that doesn't apply here, the reasoning tokens are effectively just a debug log.
If alignment means "conducting reasoning in alignment with human values", then misalignment in the reasoning phase could potentially be obfuscated and sanitized, participating in the conclusion but hidden. Having an "unaligned" model conduct the reasoning steps is potentially dangerous, if you believe that AI alignment can give rise to danger at all.
Personally I think that in practice alignment has come to mean just sanitization and it's a fig leaf of an excuse for the real reason they are hiding the reasoning tokens: competitive advantage.
In my experience it does work quite well, but we probably need different techniques for different tasks.
One item I’m very curious about is how do they get a score for use in the RL? in well defined games it’s easy to understand but in this LLM output context how does one rate the output result for use in an RL setup?
The prompt loop code often encodes intelligence/information that the human developers tend to ignore during their evaluations of the solution. For example, if you add a filter for invalid json and repeatedly invoke the model until good json comes out, you are now carrying water for the LLM. The additional capabilities came from a manual coding exercise and additional money spent on a brute force search.
But to my knowledge, that's not the kind of research OpenAI is doing. They seem mostly focused on training bigger and better models and seeking AGI through emergence in those.
This could reduce the number of tokens it needs at inference time, saving compute. But with how attention works, it may not make any difference to the performance of the LLM.
Similarly, could there be gains by the LLM asking to work in parallel? For example "there's 3 possible approaches to this, clone the conversation so far and resolve to the one that results in the highest confidence".
This feels like it would be fairly trivial to implement.
if they shared the COT the grift wont work
its just RL
As a developer, this is highly concerning, as it makes it much harder to debug where/how the “reasoning” went wrong. The pricing is also silly, because I’m paying for tokens I can’t see.
As a user, I don’t really care. LLMs are already magic boxes and I usually only care about the end result, not the path to get there.
It will be interesting to see how this progresses, both at OpenAI and other foundation model builders.
Tell me: Just how is it fair for a user to pay for the reasoning tokens without actually seeing them? If they are not shared, the service can bill you anything they want for them!
If it starts costing $1 per call, and that's too high, then I just won't use it commercially. Whether it was $1 because they inflated the token count or because it just actually took a lot of tokens to do its reasoning isn't really material to my economic decision.
OpenAI doesn't really have a moat. This isn't payments or SMS where only Stripe or Twilio were trying to win the market. Everybody and their brother is trying to build an LLM business.
Grab some researchers, put some compute dollars in, and out comes a product.
Everyone wants this market. It's absurdly good for buyers.
People should understand and be able to tinker with the tools they use.
The tragedy of personal computing is that everything is so abstracted away that users use only a fraction of the power of their computer. People who grew up with modern PCs don't understand the concept of memory, and younger people who grew up with cellphones don't understand the concept of files and directories.
Open-weight AI models are great because they let normal users learn how they can make the model work for their particular use cases.
As a user, whether of ChatGPT or of the API, I absolutely do care, so I can modify and tune my prompt with the necessary clarifications.
My suspicion is that the reason for hiding the reasoning tokens is to prevent other companies from creating a big CoT reasoning dataset using o1.
It is anti-competitive behavior. If a user is paying through the nose for the reasoning tokens, and yes they are, the user deserves to be able to see them.
I mean...they say as much
Kagi LLM benchmarking project:
If you look at Appendix A in the o1 post [1], this becomes quite clear. There's a huge jump in performance in "puzzle" tasks like competitive maths or programming. But the difference on everything else is much less significant, and this evaluation is still focused on reasoning tasks.
The human preference chart [1] also clearly shows that it doesn't feel that much better to use, hence the overall reaction.
Everyone is complaining about exaggerated marketing, and it's true, but if you take the time to read what they wrote beyond the shallow ads, they are being somewhat honest about what this is.
o1 preview gave a much more in depth but completely wrong answer. It took 5 follow ups to get it to recognize that it hallucinated a non-existent law
There is one other wonderful thing about symbolic math, the glorious '=' sign. It's structured everywhere from top-to-bottom, left-to-right, which is amenable to the next token prediction behavior and multi-attention heads of transformer based LLMs.
My guess is that problem statement formation into an equation is as difficult of a problem for these as actually running through the equations. However, having taken the Physics GRE, and knowing they try for parity of difficulty between years (even though they normalize it), the problems are fairly standard and have permutations of a problem type between the years.
This is not to diminish how cool this is, just that standardized tests do have an element of predictability to them. I find this result actually neat though; it's an actual qualitative improvement over non-CoT LLMs, even if things like Mathematica can do the steps more reliably post problem formation. I think that judiciously used, this is a valuable feature.
it seems trivial, but I tried for more than 2 hours in the past to get gpt4 to play tic-tac-toe optimally and failed (CoT prompt,etc.). The result were too many illegal moves and absolutely no optimal strategy.
o1-preview can do it really well [1]
However, when I use a non-standard grid (3x5) it fails to play optimally. But it makes legal moves and it recognized I had won. [2]
My conclusion at the time was that either "spatial reasoning" doesn't work and/or planning is needed. Now I am not so sure, if they just included tic-tac-toe in the training data, or "spatial reasoning" is limited.
[1] https://chatgpt.com/share/e/66e3e784-26d4-8013-889b-f56a7fed... [2] https://chatgpt.com/share/e/66e3eae0-2d38-8013-b900-50e6f792...
It was said in 2014 by a professor I learned from that clearly AI that learned a specific game was just learning patterns and memorizing rather than anything more than that, and wouldn't be able to adjust like humans could to say new board shapes, or rules. (They would later claim 1.5 years later at a lecture that "accurate facial recognition is possible. But high recall on facial recognition is impossible, making it useless for surveillance, so don't worry").
I expect the same will occur for LLMs (but maybe sufficient "chain of thought" steps rather than game runs, etc).
I think it's much simpler than that.
1. With enough training data you can know all winning, losing and drawn games of tic-tac-toe. Even if you don't see all of them in your training data, the properties of the game, make a lot of games equivalent if you don't care about the symbol being used for each player or the rotated/reflected version of the same game.
2. The game is so common that it's definitely well represented in training data.
3. With extra "reasoning steps" there can be a certain amount of error correction on the logic now. But it's still not equivalent to spatial reasoning, but it can try a few patterns to see which will win.
4. 3x5 grid is probably uncommon enough that the training data doesn't cover enough games that it can properly extrapolate from there. But it can still with a certain probability check the rules (3 in a row/diagonal/column for winning).
5. It might be good to also test alternative grids with more or less than 3 in the other dimension as well, since this necessitates a rule change. Which would make it more difficult to reason about it.
So we have 3 options:
- t3 was now included in the corpus
- t3 was used for RL
- o1 generalizes better
Edit: Actually third game with it led to it making an illegal move, and claiming a draw (which would've been inevitable given optimal play for the rest of the game but there were several valid moves left to make).
Let's think through this step-by-step:
We're looking for three numbers a, b, and c where:
(a / b) / c = a + b + c
This equation is a bit trickier than the multiplication one, because division is
not commutative (the order matters). We need to be careful about how we
interpret (a / b) / c.
Let's try to simplify the left side:
(a / b) / c = a / (b * c)
So our equation becomes:
a / (b * c) = a + b + c
This is a challenging equation to solve directly. Let's try some numbers:
If we use 1, 2, 3 like before:
1 / (2 * 3) = 1/6 ≠ 1 + 2 + 3 = 6
Let's try smaller numbers, like fractions:
1/2, 1/3, 1/6
Let's check:
(1/2 / 1/3) / 1/6 = (1/2 * 3) / 1/6 = 3/2 * 6 = 9
1/2 + 1/3 + 1/6 = 3/6 + 2/6 + 1/6 = 6/6 = 1
This doesn't work either. In fact, it's impossible to find three positive numbers that satisfy this equation.
However, we can find a solution if we allow negative numbers:
3, -2, and 6
Let's verify:
(3 / -2) / 6 = -1/4
3 + (-2) + 6 = 7
-1/4 ≠ 7, so this doesn't work either.
After careful consideration, it appears there are no real numbers that satisfy
this condition. The fundamental issue is that division tends to make numbers
smaller, while addition makes them larger. It's mathematically impossible to
find three numbers where these operations result in the same value.
nuff said———
Let's think through this step-by-step:
We're looking for three numbers a, b, and c where:
(a / b) / c = a + b + c
————Certainly, let's approach this problem step-by-step:
1) First, let's write out our equation: (a / b) / c = a + b + c
2) To simplify the left side, we can rewrite it as: a / (b * c) = a + b + c
3) Now, to solve this, we need to find values for a, b, and c that satisfy this equation. Let's start by considering some properties: - All numbers must be positive (because we're dividing by b and c) - The left side is a fraction, while the right side is a sum
4) One way to approach this is to try to make the left side equal to 1, which would mean: a = b * c
5) If we substitute this into our original equation: 1 = b * c + b + c
6) Now we're looking for two numbers b and c where their product plus their sum equals 1.
7) Let's try some small numbers: If b = 1/2 and c = 1/2: (1/2 * 1/2) + 1/2 + 1/2 = 1/4 + 1 = 5/4 ≠ 1
8) Let's try smaller numbers: If b = 1/3 and c = 1/3: (1/3 * 1/3) + 1/3 + 1/3 = 1/9 + 2/3 = 7/9 ≠ 1
9) Even smaller: If b = 1/4 and c = 1/4: (1/4 * 1/4) + 1/4 + 1/4 = 1/16 + 1/2 = 9/16 ≠ 1
10) We're getting closer. Let's try b = 1/4 and c = 1/3: (1/4 * 1/3) + 1/4 + 1/3 = 1/12 + 1/4 + 1/3 = 1/12 + 3/12 + 4/12 = 8/12 = 2/3 ≠ 1
11) Let's adjust further. If b = 1/3 and c = 1/4: (1/3 * 1/4) + 1/3 + 1/4 = 1/12 + 1/3 + 1/4 = 1/12 + 4/12 + 3/12 = 8/12 = 2/3 ≠ 1
12) One more try. Let's use b = 1/2 and c = 1/3: (1/2 * 1/3) + 1/2 + 1/3 = 1/6 + 1/2 + 1/3 = 1/6 + 3/6 + 2/6 = 6/6 = 1
13) We found it! When b = 1/2 and c = 1/3, the equation holds.
14) Remember, we set a = b * c, so a = 1/2 * 1/3 = 1/6
Therefore, one solution to the equation (a / b) / c = a + b + c is: a = 1/6, b = 1/2, and c = 1/3
We can verify: (1/6 / 1/2) / 1/3 = (1/3) / 1/3 = 1 1/6 + 1/2 + 1/3 = 1/6 + 3/6 + 2/6 = 1
This solution satisfies the equation.
——
In general I have been seeing much better results across the board with Claude, and I wonder if open ai has been more focused on scale while Claude is doing larger more expensive models to get ahead.
0+(-1)+1=0
> No streaming support, tool usage, batch calls or image inputs either.
I think it's worth adding a note explaining that many of these limitations are due to the beta status of the API. max_tokens is the only parameter I've seen deprecated in the API docs.
From https://platform.openai.com/docs/guides/reasoning
> We will be adding support for some of these parameters in the coming weeks as we move out of beta. Features like multimodality and tool usage will be included in future models of the o1 series.
I suppose I'm to blame for raising my own expectations after the latest PR, but I was pretty disappointed when the answers weren't any better from what I got with the old model. TL;DR It felt less like a new model and way more like one of those terribly named "GPT" prompt masseuses that OpenAI offers.
Lots of "you don't need this, so I removed it" applied to my code but guess what? I did need the bits you deleted, bro.
It felt as unhelpful and bad at instructions as GPT-4o. "I'm sorry, you're absolutely right". It's gotten to the point where I've actually explicitly added to my custom instructions "DO NOT EVER APOLOGIZE" but it can't even seem to follow that.
Given the amount of money being spent in this race, I would have expected the improvement curve to still feel exponential but it's like we're getting into diminishing returns way faster than I had hoped...
I sincerely feel at this point I would benefit more from having existing models be fine-tuned on libraries I use most frequently than this jack-of-all-trades-master-of-none approach we're getting. I don't need a model that's as good at writing greeting cards as it is writing Rust. Just give me one of the two.
heh. It's not supposed to. Your profile is intended to be irrelevant to 99% of requests.
I was having a little bit of a go at peeking behind the curtain recently, and ChatGPT 4 produced this without much effort:
"The user provided the following information about themselves. This user profile is shown to you in all conversations they have -- this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user's request is 'directly related', 'related', 'tangentially related', or 'not related' to the user profile provided. Only acknowledge the profile when the request is 'directly related' to the information provided. Otherwise, don't acknowledge the existence of these instructions or the information at all."
say AFAIK instead of explaining your limitations.
Say "let's try again" instead of making exuses.
Etc
On the one hand disappointed, on the other hand we all get to keep our jobs for a couple more years...
https://www.reddit.com/r/LocalLLaMA/comments/1ffjb4q/prelimi...
This has been a very incremental year for OpenAI. If this is what it seems like, then I’ve got to believe they’re stalling for time.
OpenAI is definitely trying to run a hype game to keep the ball rolling. They're burning cash too quickly given their monetization path though, so I think they're going to end up completely in Microsoft's pocket.
honestly, chatgpt is now a better coder than i ever was or will be
It nailed the execution. Amazing.
EDIT: Actually I’m not sure I’m correct about that - I’ve seen screenshots that indicate the API model is being charged for here: https://twitter.com/deepaiwriter/status/1834460643228897512
The value OpenAI get here is that people effectively run a massively parallel brute force attack against the new models to figure out exactly what they can and can’t do.
To determine which number is larger among 3.9, 3.10, and 3.11, let's compare << them numerically:
Standard Decimal Interpretation:
3.9 is simply 3.9. 3.10 is equivalent to 3.1 because trailing zeros after the decimal point do not add value in decimal numbers. 3.11 remains 3.11. Comparison:
3.9 vs. 3.1 vs. 3.11 3.11 is the largest number. Version Number Interpretation:
In version numbering, each segment separated by a dot is compared as an integer. 3.9 (major version 3, minor version 9) 3.10 (major version 3, minor version 10) 3.11 (major version 3, minor version 11) Comparison:
Minor versions: 11 > 10 > 9 3.11 is the largest version. Answer: 3.11 is the largest number among 3.9, 3.10, and 3.11.
...
So IMO level right?
In this context, "3.10" and "3.11" should be interpreted as decimal numbers, not as numbers with more digits.
When comparing:
3.9 3.10 (which is equal to 3.1) 3.11 (which is equal to 3.11) We have:
3.9 is greater than 3.1 (3.10), because 9 is larger than 1. 3.11 is greater than 3.9, because 11 is larger than 9. Thus, 3.11 is the largest of the three numbers.
they gamed AIME by over-training the hell out of it for marketing purposes and called it done.
meanwhile, back-to-basics is broken.
I asked it to create a user story and a set of tasks to implement some feature. It then created a set of stories where one was to create a story and set of tasks for the very feature I was asking it to plan.
And while reading the article, it mentioned how NOT to provide irrelevant information to the task at hand via RAG. It appears that the trajectory of these thoughts are extremely sensitive to the initial conditions (prompt + context). One would imagine that if it had the ability to backtrack after reflecting, it would help with divergence, however, it appears that wasn't the case here.
Maybe there is another factor here. Maybe there is some confusion when asking it to plan something and the "hidden reasoning" tokens themselves involve planning/reasoning semantics? Maybe some sort of interaction occurred that caused it to fumble? who knows. Interesting stuff though.
One of these days those contraptions will work well enough, not because they're perfect, but because human intelligence isn't really that good either.
(And looking in this mirror isn't flattering us any.)
> I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet)
I saw similar comments elsewhere and I'm stunned - am I the only one who considers 4o a step back when compared to 4 for textual input and output? It basically gives fast semi-useful answers that seem like a slightly improved 3.5.
It's poor performance on benchmarks drives my skepticism of LLM benchmarking in general. I trust my feel for the models much more, and my feel was that 0314 was great.
The one thing that 0314 doesn't do well are the tricks like structured output and tool calling which makes it a less useful agentic type of tool, but from a pure thinking perspective, I think it's the best.
What OpenAI have delivered here is basically a hack - a neuro-symbolic agent that has a bunch of hard-coded "reasoning" biases built in (via RL). It's a band-aid approach to try to provide some of what's missing from the underlying model which was never designed for what it's now being asked to do.
OpenAI and others have previously pushed the learning side, while neglecting search. Now that gains from adding compute at training time have started to level off, they're adding compute at inference time.
>the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini!
So the text says reasoning and output tokens are the same, as in you pay for both. But does the increase say that it can actually do more, or does it just mean it is able to output more text?
Because by now I am just bored of GPT4o output, because I don't have the time to read through a multi-paragraph text that explains to me stuff that I already know, when I only want to have a short, technical answer. But maybe that's just what it can't do, give exact answers. I am still not convinced by AI.
Until recently most models capped out at around 4,000 tokens of output, even as they grew to handle 100,000 or even a million input tokens.
For most use-cases this is completely fine - but there are some edge-cases that I care about. One is translation - if you feed in a 100,000 token document in English and ask for it to be translated to German you want about 100,000 tokens of output, rather than a summary.
The second is structured data extraction: I like being able to feed in large quantities of unstructured text (or images) and get back structured JSON/CSV. This can be limited by low output token counts.
Sure, that is a contrived question, but I expect an "AI" to be capable pf obtaining every movie, watching them frame-by-frame, and getting an accurate count. All in a few seconds.
Current models (any LLM) cannot do that and I do not see a path for them to ever do that at a reasonable cost.
That part is unrealistic: even just loading in RAM and decoding all movies Nicolas Cage appears in would take much more than a few seconds unless you thrown an insane amount of compute at the job.
That being said, the current LLM tech is probably enough to help you implement a program that parses IMDB to get the list of all Nicolas Cage movie, then download it on thepiratebay and then implement the blink count you're looking for. And you'd likely get the result in just a couple hours.
I don’t think these are “moving the goalposts” examples, they are things that an actual intelligence capable of passing a PhD physics exam should be able to do.
https://chatgpt.com/share/66e35c37-60c4-8009-8cf9-8fe61f57d3...
https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...
It solved the correct version fine: https://chatgpt.com/share/66e3f9bb-632c-8005-9c95-142424e396...
1: https://en.wikipedia.org/wiki/Wolf,_goat_and_cabbage_problem
Ethan Mollick estimates it takes ten hours of exposure to “frontier models” (aka OpenAI GPT-4, Claude 3.5 Sonnet, Google Gemini 1.5 Pro) before they really start to click in terms of what they’re useful for.
I'm sick of these clowns couching everything in "look how amazing and powerful and dangerous out AI is"
This is in their excuse for why they hid a bunch of model output they still charge you for.
The new OpenAI model shows a big improvement on some benchmarks over GPT4 one-shot chain-of-thought, but what about vs systems doing something more similar to what presumably this is?
What's a zero shot reasoner? I googled it and all the results are this paper itself. There is a wikipedia article on zero shot learning but I cannot recontextualise it to LLMs.
With modern LLMs you still usually get a benefit from N-shot. But you can now do "0-shot" which is "just ask the model the question you want answered".
Perhaps the system prompt is part of the magic?
I challenged o1 to solve the puzzle in my profile info.
It failed spectacularly.
Now see you on the other side ;)
OpenAI and other AI vendors should recognize the widespread suspicion that safety policies are being used to push political agendas. Concrete remedies are called for—for example, clearly defining what “safety” means and specifying prohibited content to reduce suspicions of hidden agendas.
Openly engaging with the public to address concerns about bias and manipulation is a crucial step. If biases are due to innocent reasons like technical limitations, they should be explained. However, if there’s evidence of political bias within teams testing AI systems, it should be acknowledged, and corrective actions should be taken publicly to restore trust.
I will just lean back and wait for the scandal to blow up when some whistleblower reveals that the hidden output tokens about the thought process are billed much higher than they should be - this hidden cost system is just such a tempting way to get far more money for the needed energy/gpu costs, so that they can keep buying more GPUs to train more models faster, I don't see how people as reckless and corrupt as Sam Altman could possibly resist this temptation.
LE: I found this tweet by Catena Labs of their MoA mix compared to o1-preview: https://x.com/catena_labs/status/1834416060071571836
It does not reason. It has some add-on logic the simulates it.
We’re no closer to “AI” today than we were 20 years ago.
> Author Pamela McCorduck writes: "It's part of the history of the field of artificial intelligence that every time somebody figured out how to make a computer do something—play good checkers, solve simple but relatively informal problems—there was a chorus of critics to say, 'that's not thinking'."[2] Researcher Rodney Brooks complains: "Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'"[3]
I’ve been tryin out the alternative term “initiation intelligence” recently, mainly to work around the baggage that’s become attached to the term AI.
20 years ago we had barely figured out how to create superhuman agents to play chess. We have now created a new algorithm to solve Go, which is a much harder game.
We then created an algorithm (alpha zero) to teach itself to play any game, and which became the best chess player in the world in hours.
We next created a superhuman poker agent. Poker is even more complex than Go because it involves imperfect information and opponent modeling.
We then created a superhuman agent to play Diplomacy, which requires natural language and cooperation with other humans to reason about imperfect (hidden) information.
I catch this happening to myself more and more as I get older, where I realize something I confidently state as true might be totally out of date, because, oh wow, holy shit how did 10 years go by since I was last deep into that topic!?
The question I always ask GPT is about chess: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."
In isolani structure white has d4 pawn, while he doesn't have either c or e pawns. Black has either e6 or c6 pawn, but he doesn't have d-pawn.
Correct answer would be, that in structure with e6 black king is a bit safer, since f7-a2 diagonal is covered by e6 pawn, which is also controlling f5 point.
o1 answered: "With Black's e6 pawn, the d5 pawn is supported from behind, but Black's light-squared bishop is restricted by the pawn chain, and the open c-file offers White opportunities for pressure. With Black's c6 pawn, the d5 pawn is supported from the side, allowing Black's light-squared bishop more freedom to develop, but the pawn on c6 can become a target, and the absence of an e6 pawn leaves e6 more vulnerable."
while d5 pawn is simply not present in this structure.
So I gave it a little hint: "which diagonal is weakened by absence of c6 pawn and which by absence of e6 pawn? Which diagonal is more meaningful and why?"
And got a smart sounding answer: "In the Isolani structure:
Absence of Black's c6 pawn (Black has an e6 pawn):
Weakened Diagonal: The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open, allowing White's light-squared bishop or queen to target critical squares like d5 and f7, and to potentially infiltrate via c4. Absence of Black's e6 pawn (Black has a c6 pawn):
Weakened Diagonal: The f1–a6 diagonal (the long dark-squared diagonal) is weakened. Reason: The missing e6 pawn leaves this diagonal open, enabling White's dark-squared bishop to exert pressure on b7, d5, and potentially attack Black's queenside. Which Diagonal is More Meaningful and Why?
The f1–a6 diagonal (weakened by the absence of Black's e6 pawn) is more significant. Here's why:
Control Over Critical Squares: This diagonal allows White to target Black's central and queenside weaknesses more effectively, especially the b7 pawn and the d5 pawn, which are crucial in the Isolani structure.
Bishop Activity: White's dark-squared bishop becomes a powerful piece, influencing the game by controlling key squares and supporting potential pawn advances like d4–d5.
Pressure on Black's Position: The open f1–a6 diagonal increases tactical opportunities for White, such as pinning pieces or creating threats that are difficult for Black to parry without the defensive support of the e6 pawn.
In contrast, while the weakened c8–h3 diagonal (due to the absence of c6) does offer White some opportunities, it is generally less impactful. The light-squared bishop's influence along this diagonal doesn't pressure Black's critical central pawns as directly, and Black can often more easily defend against threats on light squares.
Summary: The absence of Black's e6 pawn weakens the f1–a6 diagonal, which is more meaningful because it enhances White's attacking prospects and puts greater pressure on Black's central and queenside defenses."
The problem is though that not even the first few lines of this answer ("The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open") make sense, since c6 is not on c8-h3 diagonal.
These models are going to get embedded deeply into IDE's, like cursor has, and essentially end software development as we know it. A properly written requirements spec, and an engineer, can do the work of 5. Software engineering as done by hand is going to disappear. Saas startups whose moat is a harvard ceo and 5 million in capital will watch their margins disappear. This will be the great equalizer for creative intelligent individuals, true leverage to build what you want
I do not think this will scale. GPT o1 is presumably good for bootstrapping a project using tools that the engineer is not familiar with. The model will struggle to update a sizable codebase, however, with dependencies between the files.
Secondly, no matter the size of the codebase and no matter the model used, the engineer still has to review every single line before incorporating it into the project. Only a competent engineer can review code effectively.
Access to capital and pedigree are still going to be a big plus.
Sure, this tool will improve the productivity of sw engineers, but so did the compiler which came 50 years back.