Interestingly, my benchmark results of GPT 4 Turbo show an opposite result: the new gpt-4-1106-preview did significantly better on the first try than the March and June models.
https://aider.chat/docs/benchmarks-1106.html
Aider benchmarks against the 133 Exercism python exercises, not js exercises that mentat's benchmark uses. So this is not an apples-to-apples comparison, but there doesn't seem to be a strong reason to expect qualitatively different results.
I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches.
https://github.com/AbanteAI/mentat/blob/main/tests/benchmark...
https://github.com/paul-gauthier/aider/blob/main/benchmark/p...
Edit: Not sure if the mentat authors are in this thread? After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated. It might even be required under aider's Apache 2.0 license?
> I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches.
We were inspired by you to use Exercism as a benchmark, thank you! We will add attribution for that. We switched our original instruction prompts for that benchmark to be similar to Aiders to allow for fair comparison.
> After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated.
We have an unused implementation of your output response format (https://github.com/AbanteAI/mentat/blob/main/mentat/parsers/...), but I don't know what else you are seeing? We implemented that to compare with our response formats and didn't find much difference in performance.
The "code map" PR in particular mentions being "inspired by aider", links to aider and seems to include a bunch of code from aider's old ctags based "repo map" implementation. This isn't an insignificant component of an AI coding tool.
Aider is open source and I try and share my learnings as I'm building it. So it's great when other projects get inspiration from aider! But it is polite to provide attribution for such inspiration, especially if you crib from code with an attribution license.
[0] https://github.com/search?q=repo%3AAbanteAI%2Fmentat+aider&t...
Also THANK YOU for Aider! I talk it up to all my programmer friends; it really feels like a glimpse into the future of coding.
Wouldn't this actually be exactly proof that the model has improved over its predecessor by having to solve the problem itself rather than rely on memory?
What use is a model that memorizes the answers to all the benchmarks (see the 7b models on open llm leaderboard for more info on that).
What is it about software development in particular that makes people so seemingly ethically unfettered by blatant plagiarism?
Thanks for asking. I've been meaning to address these kinds of questions in the aider FAQ [0]. Here's the entry I just added:
Aider supports pretty much all the popular coding languages. This is partly because GPT-4 is fluent in most mainstream languages, and familiar with popular libraries, packages and frameworks.
In fact, coding with aider is sometimes the most magical when you're working in a language that you are less familiar with. GPT often knows the language better than you, and can generate all the boilerplate to get to the heart of your problem. GPT will often solve your problem in an elegant way using a library or package that you weren't even aware of.
Aider uses tree-sitter to do code analysis and help GPT navigate larger code bases by producing a repository map [1].
Aider can currently produce repository maps for most mainstream languages, listed below. But aider should work quite well for other languages, even without repo map support.
- C
- C#
- C++
- Emacs Lisp
- Elixir
- Elm
- Go
- Java
- Javascript
- OCaml
- PHP
- Python
- QL
- Ruby
- Rust
- Typescript
[0] https://aider.chat/docs/faq.html#what-code-languages-does-ai...I think the evidence from these tests are not strong enough to draw conclusions from.
I see a lot of room of improvement in how we apply statistics to understanding LLM performance.
I've used gpt4 turbo for some coding problems yesterday. It was worse. That's enough to draw conclusions for me.
Previous Model: gpt-3.5-turbo-16k, 16385 tokens context and completion (shared)
New Model: gpt-3.5-turbo-1106, 16385 tokens context, 4096 tokens completion
Previous Model: gpt-4, 8192 tokens context and completion (shared)
New Model: gpt-4-1106-preview, 128000 tokens context, 4096 tokens completion
Why would the same size of a 16K GPT-3.5 model now not allow larger completion sizes?
Why would the new GPT-4 reduce the completion tokens as well, gpt-4 can do 8192 and gpt-4-32k can do 32768 completion tokens. Now the limit is 4096.
So you would need to change the way you prompt (split the responses) to be able to get a longer response.
---
So are these new models taking the old base models of 4K tokens context and completion and changing the context to 128000 but leaving the completion the same? If they could get gpt-4 to have gpt-4-8k and gpt-4-32k, why couldn't have it been 128000 context and 32768 completion?
From my local test on a 13B model, output tokens are 20-30x more expensive than input tokens. So OpenAI's pricing structure is based on expectation that there's much more input than output tokens in an average response. It didn't matter too much if a small percentage of requests used all 4k tokens for output, but with 128k it's a different story.
But in practice, even when context_size max output token count was enabled, it simply couldn't make use of it, no matter how many prompt engineering tricks I threw at it.[1] And I've heard anecdotally that it's true for that LoRA-type technique.
[1] TL;DR, about 1/5th the actual length: write 100 pages, 3 paragraphs each, number the pages as you go and write 1 page at a time until 100. Also write out "I have written page N and need to write 100 pages total" after each page.
Inevitably it would "get tired" and be like "end page 23...now page 100"
Read the following passage from [new ML article]. Identify their assumptions, and tell me which mathematical operations or procedures they use depend upon these assumptions.
GPT-4: Usually correctly identifies the assumptions, and often quotes the correct mathematics in its reply.
GPT-4 Turbo: Sometimes identifies the assumptions, and is guaranteed to stop trying at that point and then give me a Wikipedia-like summary about the assumptions rather than finish the task. Further prompting will not improve its result.
This summarizes all my skepticism agains the AI field. Pretty clear that they aren't solving the problem, they have them memorized.
For example the function stubs I can find are "value_of_card(<card>)" in exercise "Black Jack", or "generate_seat_letters(<number>)" in exercise "Plane Tickets". I think I could guess those without seeing the rest of the question.
Using GPT-4 Turbo yesterday, I feel like I'm moving to pages of code at a time now.
Taking the ideas in my head and turning them into reality is so easy now
That doesn’t mean it’s only able to solve problems in its training set (tho it’s much better at that obviously.)
When did natural language became better for expressing development ideas than code? I know – when you don't know how to code in the first place. Then you will have to bet on all of the ambiguities of the language, cultural and meta-physical which words carry in order to hack your thing together instead of expressing yourself directly and explicitly.
Finally what is beautiful about strict code format we are so used to - it is truly the fastest and shortest path to get your thing done, in case you possess the knowledge needed.
These tools will empower folks who aren’t developers to build stuff and maybe learn a bit more about how programming works.
They will enable folks who have ideas, but can’t express them, to actually be able to create what they are imagining.
That’s awesome.
Code isn’t beautiful (except for a few rare exceptions). Creating something with code is.
Since I created this dataset by hand, it can't really be memorized. I'm sure there's _similar_ data in the training set, but answering correctly still requires some reasoning-like capabilities.
For my NLP pipelines, I batch n-articles together to process (extract fields from) in one prompt (final output is something like this {"1":[{}], "2": [{},{}]...}) in one message. Compute-wise it's inefficient but OpenAI charges by the token so it doesn't matter. It's very reliable on gpt-4 8k.
I was also pretty happy with the results on 4-turbo initially but it seems that once you go past 30k-ish tokens in context (needs way more testing), it shits itself. The indexes don't match anymore and n_final_output is different from n_articles.
Still, great model and even if the limits are lower in practice I suspect I'll get good use out of it.
Edit: With better prompting, it feels stable at n=42, ~42000 prompt tokens.
Not to say that the parent post is incorrect, of course. Only that its not as cut and dry as a "GPT4 Turbo is distilled (read: watered down) GPT4".
We're transitioning from a legacy codebase full of regexes and undocumented functions that are understood only by the developer and god. The developers left and I don't believe in god. We tried throwing the unstructed mess to GPT, alongwith a few examples and got surprisingly good results.
This makes me think of how hardware manufacturers optimize for benchmarks. Closed source LLMs can intentionally include likely test data in their training set to artificially inflate results. I'm not saying they are intentionally doing that now, but they could.
I agree that the author of the tweet fairly underestimates the potential portion of OCR'ed contents in OpenAI's training data. In late August, Nougat[1] is released by Meta, this is an OCR model. Its performance are wild and the model is open source.
I hardly believe that OpenAI does not spend effort on getting more training from OCR content. I also hardly believes that OpenAI waits for a Meta paper to have an internal performant OCR model.
This also makes me look at GPT-4 as a "weak reasoner with a lot of knowledge". That really aligns with my experience where it is immensely helpful and has a superhuman knowledge base but still needs handholding to solve real problems.
I've definitely had instances where 4 memorized a common puzzle and failed a subtly altered variant but then got the variant after changing variable names or otherwise making it look different from what it would have memorized.
```
function helloWorld() {
return "";
}helloWorld()
```
but those sorts of obvious examples are mostly in the beginner exercises, so I wonder what the distribution of the correct answers was. If it was guessing based on function stubs, the prediction would be that correct answers would be clustered around the beginner exercises, and that as the exercises advanced in difficulty, there were fewer correct answers.
Seems like sacrificing some quality for large gains on speed and cost but anyone know more detail?
I'm definitely curious on the context window increase-- I'm having a hard time telling if it's 'real' vs a fast specially trained summarization prework step. That being said, it's been doing a rather solid job not losing info in that context window in my minor anecdotal use cases.