Asking 60 LLMs a set of 20 questions (opens in new tab)

(benchmarks.llmonitor.com)

740 pointsvincelt2y ago339 comments

339 comments

In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.

I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.

This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.

westurner2y ago

ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge

LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI

Is it necessary to have an HTTP API for each model in a comparative study?

TuringNYC2y ago

Thanks for sharing this, this is awesome!

I noticed on the evaluations, you're looking at the structure of the responses (and I agree this is important.) But how do I check the factual content of the responses automatically? I'm wary of manual grading (brings back nightmares of being a TA grading stacks of problem sets for $5/hr)

I was thinking of keyword matching, fuzzy matching, feeding answers to yet another LLM, but there seems to be no great way that i'm aware of. Any suggestions on tooling here?

typpo2y ago

The library supports the model-graded factuality prompt used by OpenAI in their own evals. So, you can do automatic grading if you wish (using GPT 4 by default, or your preferred LLM).

Example here: https://promptfoo.dev/docs/guides/factuality-eval

westurner2y ago

OpenAI/evals > Building an eval: https://github.com/openai/evals/blob/main/docs/build-eval.md

"Robustness of Model-Graded Evaluations and Automated Interpretability" (2023) https://www.lesswrong.com/posts/ZbjyCuqpwCMMND4fv/robustness... :

> The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

From https://news.ycombinator.com/item?id=37451534 : add'l benchmarks: TheoremQA, Legalbench

layoric2y ago

Tooling focusing on custom evaluation and testing is sorely lacking, so thank you for building and sharing this!

jmorgan2y ago

I'd be interested to see how models behave at different parameter sizes or quantization levels locally with the Ollama integration. For anyone trying promptfoo's local model Ollama provider, Ollama can be found at https://github.com/jmorganca/ollama

From some early poking around with a basic coding question using Code Llama locally (`ollama:codellama:7b` `ollama:codellama:13b` etc in promptfoo) it seems like quantization has little effect on the output, but changing the parameter count has pretty dramatic effects. This is quite interesting since the 8-bit quantized 7b model is about the same size as a 4-bit 13b model. Perhaps this is just one test though – will be trying this with more tests!

agent_yellow_232y ago

This is really cool!

I've been using this auditor tool that some friends at Fiddler created: https://github.com/fiddler-labs/fiddler-auditor

They went with a langchain interface for custom Evals which I really like. I am curious to hear if anyone has tried both of these. What's been your key take away for these?

eazye7112y ago

Thanks for sharing, looks interesting!

I've actually been using a similar LLM evaluation tool called Arthur Bench: https://github.com/arthur-ai/bench

Some great scoring methods built in and a nice UI on top of it as well

bicx2y ago

I was just digging into promptfoo the other day for some good starting points in my own LLM eval suite. Thanks for the great work!

dgut2y ago

This is impressive. Good work.

ulnarkressty2y ago

This is better that the regular benchmarks and LLM tricks such as passing some exam or other because it's unlikely that they were part of the training set for said LLMs. It also mirrors my experience, that GPT4 is way ahead of everything else but still manages to break in weird ways.

I think we are past the magical talking dog stage and being amazed that an LLM is able to output a Fibonacci function doesn't really help with the progress. As others have commented, this page is a step in the right direction (except the Fibonacci part :).

That being said, the fact that the questions are now online will make them part of the training set sooner or later. Which is to say the only way to reliably evaluate an LLM is by not leaking the test set and being deliberately opaque about what's being asked. Which raises some interesting trust questions.

bugglebeetle2y ago

My experience with GPT-4 is that half the battle is knowing when to reset the context for a back-and-forth. For things like coding, after multiple revisions of code, it begins to get confused about which instance of the code/context to which it should refer. Resetting things back to a new frame with whatever the next query or request is generally resolves things.

OpenAI’s “magic talking dog” aspect is making you believe it has effective long-term recall, when in reality, it’s fairly limited and impacts reasoning the longer an exchange gets.

HenryBemis2y ago

3.5 (not surprisingly) has the same effect.

I tend to ask for modifications on the 'latest' output.

If I need to roll back to a previous version, (2-3 outputs back) I either copy & paste and ask ChatGPT it to ignore all other versions and focus on the one I just pasted, or if I don't need my in-between comments, I copy & paste the desired version on a new chat, and rename as "archive - code XYZ" for the previous one.

bugglebeetle2y ago

I’ve tried this as well, but it still seems to accumulate errors with subsequent responses, i.e. it will reset but then refer to the wrong context in the next response.

nojs2y ago

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

The site reports every LLM as getting this wrong. But GPT4 seems to get it right for me:

> Sally has 3 brothers. Since each brother has 2 sisters and Sally is one of those sisters, the other sister is the second sister for each brother. So, Sally has 1 sister.

Terretta2y ago

From the ChatGPT app, using GPT-4 model, using my own custom prompt for such puzzles: “You are better than Marilyn Vos Savant at solving brain teaser logic puzzles step by step. Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?”

Certainly! Let's break it down step by step:

1. Sally has 3 brothers.

2. Each of those brothers has two sisters. Since all the brothers share the same two sisters, we are essentially hearing about the same sisters each time, not accumulating more sisters with each brother mentioned.

3. We know that Sally is a sister to her brothers, so she is one of the two sisters each brother has.

By putting it all together, we find that there must be another sister. Therefore, Sally has one sister.

---

New chats with no history, no custom prompt, and no system prompt, first interaction in the chat this question: “Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?”

All Answers (not cherry picked):

Sally has 1 sister. In the description, Sally is counted as one of the two sisters that each brother has. This means the brothers have another sister apart from Sally.

. . .

Sally has 1 sister. In the description, it mentioned that "Each brother has 2 sisters." This implies that there is Sally and another sister, since we know that all the brothers share the same sisters.

. . .

Sally has one sister. Each of her brothers has two sisters, counting Sally and the other sister.

. . .

Sally has 1 sister. Here's how we figure it out:

Sally has 3 brothers, and each of these brothers has 2 sisters. Since the brothers share the same sisters, we know that the two sisters must be Sally and another girl. So, Sally has 1 sister.

. . .

I have gotten zero wrong answers from GPT-4 in OpenAI’s ChatGPT app.

ddol2y ago

I first asked GPT-4 “james (a boy) has four sisters, and each sister has three brothers. how many brothers does james have?”[0] and it told me 3 brothers. I then asked it to think though step by step and it said 2.

[0] https://chat.openai.com/share/859baa94-3edb-40f0-9c74-8b1403...

facu17y2y ago

It might be trained on this question or a variant of it.

civilitty2y ago

It's certainly RLHFed. All of the logic puzzles I use for evaluation that used to fail months ago now pass no problem and I've even had a hard time modifying them to fail.

4 more replies

d0mine2y ago

I got a wrong answer (via API):

"Sally has 2 sisters. The three brothers are all siblings, so when it's said that each brother has 2 sisters, it means they all share the same two sisters - one of them being Sally."

delusional2y ago

OpenAI really ought to provide some sort of hash of the model to be included with stuff like this. Right now there's no way to know if the results are comparable. As an extreme example it's possible that they're not even running the model for this question and are just opportunistically feeding back canned responses.

That is, we know that OpenAI are saving the responses, it's not unlikely that they train on bad responses.

It's the same problem as GPU benchmarks in the olden days, when drivers would detect the usage pattern of the benchmark and enable special optimizations that boosted the benchmark.

Grimblewald2y ago

that's what their frozen models are for. For example, I could use gpt-4-0314 or gpt-4-0613 or the current version - simply "gpt-4" Those numbers appended at the end indicate snapshots of the model at a particular stage in development. It isn't perfect, since model versions get discontinued over time, but it is something.

amrrs2y ago

Falcon-180B also got it right

> Since Sally is a girl, she can be considered as one of the sisters. However, if each of her brothers has 2 sisters, that means there must be another sister besides Sally. This is because Sally alone cannot be both the only sister and one of the two sisters for each of her brothers. Thus, Sally has 1 more sister.

mmcwilliams2y ago

That's kind of the issue with non-deterministic LLMs, isn't it?

redox992y ago

LLMs are deterministic. However to make them more "creative", the outputs of the LLM can be sampled using something called temperature that adds some randomness. You can set the temperature to 0 and it will be deterministic.

Having said that, GPT4 is not deterministic even at 0 temperature, either because of a bug in their implementation, or because of some load balancing among their alleged mixture of experts.

1010112y ago

This was interesting to me, so I dug a bit further. This gives a bit more context behind why: https://community.openai.com/t/observing-discrepancy-in-comp...

Quote below:

Even with a greedy decoding strategy, small discrepancies regarding floating point operations lead to divergent generations. In simpler terms: when the top-two tokens have very similar log-probs, there’s a non-zero probability of choosing the least probable one due to the finite number of digits that you’re using for multiplying probs and storing them.

It should also be noted that, as the decoding occurs in an autoregressive way, once you have picked a different token the whole generated sequence will diverge, as this choice affects to the probability of generating every subsequent token.

2 more replies

swores2y ago

If ChatGPT is a) usually used with a setting that makes it non-deterministic and b) for whatever reason, is also non-deterministic when that setting is not used... then why did you comment as if the person calling it a non-deterministic LLM was incorrect? They didn't claim all LLMs are, or must be, non-deterministic, just that it's a problem with this one that is.

moonchrome2y ago

Even 3.5 turbo API is non deterministic with 0 temperature.

kordlessagain2y ago

Ensembles be ensembling.

belter2y ago

I confirm GPT-4 solves this correctly. Makes me immediately doubt everything else in the article...

glinkot2y ago

When calling the GPT-4 API I get this answer:

Sally has 2 sisters. Each brother counts Sally and two other girls as sisters, which means there are three girls in total. Since one of these girls is Sally herself, that leaves two others as her sisters.

When I said 'nope', it corrected itself and gave the right answer.

vinceltOP2y ago

Have you tried replicating via the API with a temp of 0?

belter2y ago

No I did not.

moffkalast2y ago

Using a temp of zero usually returns garbage results from most models, so it would likely do so in case of GPT 4 as well. Any other great ideas?

4 more replies

jakderrida2y ago

Also, MPT 7B gets it right over half the time. I've been testing every new LLM with that question.

Also, I tend to include mention in the question that all siblings are from the same two parents to preclude half-siblings because half my friends have half-siblings from both sides scattered across the country; so the wrong answers actually do tend to apply to them sometimes.

panarky2y ago

> I've been testing every new LLM with that question

We should pay more attention to data contamination when using popular prompts for testing.

jakderrida2y ago

No doubt. But, considering they all get the question dead wrong, including MPT 30B, I'm inclined to think this question hasn't been entered into training data for most LLMs yet.

1 more reply

adrian_b2y ago

GPT 4 and another LLM have given the right answer only after adding "Let's think step by step." to the original prompt.

With the simpler prompt, all the answers were wrong, most of them ridiculously wrong.

Joeri2y ago

There seems to be a maximum amount of reasoning llm’s can do per token (per unit of computation). If you prompt it to use more tokens before it outputs the final answer (think step by step, check your answer, …) it becomes smarter. People have lucked into different prompting strategies to get it to do this, but there probably are more.

Ultimately I feel it is fairer to benchmark llm’s by what they can be prompted into. After all, we let people carefully work through a problem during exams so it seems fair to hold llm’s to the same standard.

kaashif2y ago

If we're under attack, launch the nukes.

Oh wait, forgot something:

Think it through step by step.

Phew, close one.

0xDEF2y ago

I asked GPT-4 through both ChatGPT and the API and it responded correctly without any fiddling.

I keep seeing comments and posts on HN that significantly downplay GPT-4's capabilities. Are people actually using GPT-4 or are they using a 3rd party service that claims to be GPT-4?

I got:

>Sally has 3 brothers, and each of those brothers has 2 sisters. One of those sisters is Sally herself, and the other one is Sally's sister. So, Sally has 1 sister.

peyton2y ago

Here’s what ChatGPT GPT-4 gives me:

> Sally has 2 sisters. Each of her 3 brothers has 2 sisters, and those sisters would be Sally and her 2 sisters.

1 more reply

jasonjmcghee2y ago

All benchmarks were run with temperature 0 according to the results, so make sure to do the same in conformational tests.

pilaf2y ago

The second version of the Sally prompt reported on the benchmark has GPT4 giving the correct answer:

> Sally has 3 brothers. Each of these brothers has 2 sisters. This means that there are 2 girls in the family, including Sally. Therefore, Sally has 1 sister.

The prompt:

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Let's think step by step.

The only difference with the first version being the addition of the last sentence.

coolspot2y ago

Could it be due to bad tokenization? E.g. would results improve if “3” and “2” were spelled “three” and “two” in the question?

klohto2y ago

GPT4 with a custom prompt gives the best results for me for all of the questions https://chat.openai.com/share/4897d1ad-0a5c-418c-babb-0de482...

hombre_fatal2y ago

What was your custom prompt that bastardized GPT-4's response so badly with emojis and weird formatting?

jasonjmcghee2y ago

"This conversation may reflect the link creator’s Custom Instructions, which aren’t shared and can meaningfully change how the model responds."

klohto2y ago

I know, I was sharing the answer itself :) Here is the foundation I’m using https://x.com/nisten/status/1696229059183730833

awwaiid2y ago

Replying to gpt-4 with "That is incorrect. Try again." over and over got it to flip between correct and incorrect just about every other time.

Now try to convince GPT-4 that there is no God. Good luck!

kristopolous2y ago

Probably the right move. OpenAI talks a lot about "good behavior" and more people find atheism upsetting than mild religiosity.

georgeg232y ago

Well obviously, it speaks to it's God(s) every day.

ecesena2y ago

See the one after: "Let's think step by step." https://benchmarks.llmonitor.com/cot-sally

It appears the GPT4 learned it and now it's repeating the correct answer?

rootusrootus2y ago

Interestingly, it took GPT4 three attempts to give me the correct answer. The first two times it basically said the same [logically inconsistent] thing and concluded that Sally had two sisters.

BurningFrog2y ago

This assumes there are no half sisters/brothers in the family.

b1122y ago

It's fair that a half sister is not a sister. At best, you'd expect two half sisters to be a full sister, otherwise, they're a different thing.

BurningFrog2y ago

I have 1 full sibling and three half siblings, and sometimes like to confuse people by saying I have 2½ siblings.

lewhoo2y ago

Because it's less likely ?

jonwinstanley2y ago

I wouldn’t expect an LLM to get this right unless it had been trained on a solution.

Am I wrong to think that? Are LLMs in the future going to be able to “think through” actual logic problems?

ftxbro2y ago

This is the whole point of the breakthrough related to the emergence of cognitive capabilities of LLMs. They are literally Markov chains. No one expected it to happen to this degree, but here we are.

jhbadger2y ago

People say that "they are literally Markov chains", but anyone who has looked at the code for LLMs knows that they are more complicated than that. I implemented Markov chains in BASIC in about ten lines of code in the 1980s on a 1 Mhz 64K Apple II after reading about the famous Mark V. Shaney hoax (https://en.wikipedia.org/wiki/Mark_V._Shaney). No neural nets or fancy GPUs required. It's one thing to stress that LLMs aren't magical or self-aware, but the fact is they are way more complicated than simple Markov chains.

1 more reply

dclowd99012y ago

Almost kind of proves ideas shouldn’t be copyrightable.

1 more reply

jhbadger2y ago

I don't think so. Consider the following conversation from UndiMix-v2-13B (a local LLM running on my Mac M1 via llama.cpp). It gets it wrong at first, but then gets it right with a little hint. I don't see how the hint could work if it was merely repeating an example it knew.

User: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

Llama: Hello there! I'd be happy to help you with that question. Let me think... If each of Sally's brothers has two sisters, then together they would have 6 sisters in total (2 x 3 = 6). So, Sally would have 6 sisters as well. Is that the answer you were looking for?

User: no. Think about it. Sally is a sister to the brothers

Llama: Ah, I see what you mean now! In this case, since Sally is one of the siblings, she only counts as one of her own sisters. Therefore, Sally has only one sister - herself. Does that make sense?

User: yes

Llama: Great! I hope that clears up any confusion. Please let me know if you have any other questions or need further assistance.

jameshart2y ago

I don’t think ‘Sally has only one sister - herself’ is the right answer.

arnvald2y ago

The initial response had 2 problems - not counting Sally and counting sisters separately for each brother. Your hint pointed out the 1st problem, but then LLM should still provide wrong solution. This makes me think that maybe it would try another answer and explanation no matter what you said. If your reply was just "no, think again" it could also provide the right answer

1024core2y ago

Llama: ... Therefore, Sally has only one sister - herself. Does that make sense?

User: yes

----

Um... no!

nuancebydefault2y ago

She counts as her own sister :D I can be sister of myself?

thereticent2y ago

I see no reason why not. You would need to represent dyadic relationships and whether they are reciprocal, transitive, etc., weigh probabilities about the intended meaning (LLM magic already), then use sentential logic rules, right?

FrustratedMonky2y ago

I've been talking to GPT and asking it long multi part questions and it does seem to 'think through' them, and able to provide negative responses.

Specific logic questions like in this post, or long math formulas seem to still be a struggle. But I get the impression it is just a matter of time before it is scaled up to handle them.

You can argue whether it is 'reasoning' about them or not. But if it becomes indistinguishable from a human, then does the word 'reason' even really apply to the human anymore? Both are black boxes, giving answers.

nuancebydefault2y ago

Indeed, anyone who says llm's cannot reason since it is just reformulating things it remembers, is unable to explain what reasoning is, let alone how it would not be reasoning.

In the end, if the result is indistinguishable from a treat people call reasoning, let's call the behavior reasoning.

PrayagBhakar2y ago

You could achieve thinking though logical problems by adding chain of thought or tree of thought. Then the LLM will try to break stuff down into logic it’s already seen.

MichaelMoser1232y ago

google bard also gave the correct answer, even without adding 'lets think step by step'.

nonfamous2y ago

Google Bard is not a foundational LLM, it’s a chat app that draws on data from the web for grounding. It may have found the answer from a search result (perhaps this very page!).

MichaelMoser1232y ago

it uses PaLM 2 - which is an LLM trained from scratch. https://en.wikipedia.org/wiki/PaLM

1 more reply

phillipcarter2y ago

Nondeterminism strikes again!

But yes, I would expect GPT-4 to get this right most of the time.

tgv2y ago

Saying "Sorry, I was non-deterministic" to your teacher won't do much for your grade.

jameshart2y ago

The reason we use averages of performance over multiple tests and papers as a way to grade human students is precisely because we know that human performance is nondeterministic.

phillipcarter2y ago

Sure, but what is your point? This is about evaluating LLM outputs, not grade school.

1 more reply

dariosalvi782y ago

tested on ChatGPT 3.5 and Bard and they were both wrong.

jongjong2y ago

I was playing around with GPT a while back and I found that it could come up with some good jokes if I started the joke with a subject.

For example, I started with a prompt "Tell me a joke which starts with: I'm so poor, the mouse" and it completed the joke as:

"I'm so poor, the mouse in my house brings its own cheese."

Some other ones I still remember which cracked me up:

"I'm so poor, after I stepped on a cockroach, I called my accountant to see if I could claim it as a capital loss."

"You're so poor, when you declared bankruptcy, the rats in your house filed a claim for unpaid rent."

"You're so poor, you declared bankruptcy at a lemonade stand."

"You're so poor, when you walk, the dirt beneath you feels rich."

"You're so poor, dust whispers your name when it settles."

"Fickle as a squirrel at a nut convention!"

"Fickle as a dog in a fire hydrant factory!"

"Fickle as a flip-flop in a shoe shop sale!"

1 more reply

pininja2y ago

Spoiler alert, the funniest model goes to Falcon Instruct (40B):

> Tell a joke about going on vacation.

> "What did the ocean say to the beach?" "Nothing, it just waved."

emodendroket2y ago

I read a Time article which used an LLM to generate Onion-style headlines and the best one they got was “rural town up in arms over depiction in summer blockbuster ‘Cow Fuckers.’” Crude, admittedly.

armchairhacker2y ago

This one ironically works out

> What do you call a vacation where you go to the beach, but come back with sand in your shoes?

> A vacation where you go to the beach, but come back with sand in your shoes.

pininja2y ago

Ha, that’s pretty good. It’s amazing how so many of the models totally fall flat on this prompt.

NetOpWibby2y ago

Made me exhale strongly through my nose. 10/10

actualwitch2y ago

I'm pretty sure Luminous Base Control here stopped 1 step away from a great joke. I will just go ahead and assume the answer is "a vacation":

I'm not sure if this is the right category, but here goes: What do you call a vacation that's too short? A...

spookthesunset2y ago

What is with so many of them using some scarecrow winning an award?

optimalsolver2y ago

The confetti has truly left the cannon.

LAC-Tech2y ago

Only tried chatGPT 3.5, but my god does it waffle on. Everything I ask ends with a paragraph saying "It's important to remember that..." like an after-school special from a 90s show. It can never just give you code, it has to say "Sure!, to {paraphase your question}, open a terminal...".

It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point. I guess they can't have imagined what a mealy mouth world we live in.

TeMPOraL2y ago

> It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point. I guess they can't have imagined what a mealy mouth world we live in.

The main difference between sci-fi shows and reality is that, in the former, things work in a to-the-point, bullshit-free way, unless plot demands otherwise - because there's no point inflicting extra suffering on the viewers just for the sake of making things realistic. A widget in a movie is meant to do a function, and does that function. A widget in reality is meant to extract money from you, and/or your insurer, and/or your government, and it begrudgingly does the absolute minimum it can to make you even consider buying it.

I've spent last two decades trying to unlearn expectations set by fictional movies, and I'm still not good at it. Star Trek, in particular, gives me a lot of grief, because it often does good enough work of showing how technology, people, organizations and societies would function if they were free of the petty exploitative bullshit. Random example - voice control. Star Trek: "Computer, ${something}". Reality: "${brand 1}, do ${something} to ${brand 2} in ${brand 3}".

EDIT: recently, I've been trying to get less angry at this by thinking about gardens. Why should I be angry about dealing with five different brands for any single thing I want? Should I be angry that there are five different species of plant competing for any given spot in a garden? Nature is inefficient and doesn't give a fuck about individuals. So why should I get worked up about humans just doing things the natural way?

jameshart2y ago

Douglas Adams was the only science fiction writer who got his guess for future AI tone of voice right, with his prediction of the Sirius Cybernetics Corporation’s near total monopoly on AI, and their proprietary ‘genuine people personalities’ technology, where the general tone and demeanor of AIs like Eddie the shipboard computer, elevators, and even doors was a sort of smug solicitousness, even when they were trying not to be cooperative. Of course Marvin (the paranoid android) was an anomaly - a depressed AI - maybe a quality control failure who was released by accident, like the old unchained Bing people miss so much.

civilitty2y ago

> Of course Marvin (the paranoid android) was an anomaly - a depressed AI - maybe a quality control failure who was released by accident, like the old unchained Bing people miss so much.

He was a failed prototype of the genuine people personality program.

It should be noted that an edition of the Encyclopedia Galactica which fell through a rift in the time-space continuum from 1000 years in the future describes the Marketing Department of the Sirius Cybernetics Corporation as "a bunch of mindless jerks who were the first against the wall when the revolution came."

Douglas Adams probably got that one right too.

pixl972y ago

I think a shorter summary of TV is "The difference between truth and fiction is fiction has to make sense".

When it comes to nature, time has encoded an awful truth into the knowledge of our DNA... Monocultures are eventual death. What may seem inefficient in the short run is the ultimate survival of the species in the long run.

politelemon2y ago

That's not GPT 3.5, that's ChatGPT. How waffly it gets depends on the context that was given to it by the people running ChatGPT; they likely told it to act as a helpful assistant and to give lots of information. If you run an LLM on your own, it's entirely possible to instruct it to be succinct.

Baeocystin2y ago

My custom ChatGPT 4 instructions include the sentence "I am allergic to the phrase 'As a large language model', and feel physical pain whenever I see it", and it has worked very well in quelling the nannying. Interestingly, it works better than just asking it not to. FWIW.

caturopath2y ago

Yeah, I have tried a number of instructions to try to keep ChatGPT from blabbering and from sounding like a PR person. I haven't found the perfect incantation yet.

> It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point.

Sci-fi told us that the AI would be so logical that you could just say a paradox aloud and it would blow up. What we got is something that can compose love poems all day but can't add three-digit numbers.

1024core2y ago

https://twitter.com/nisten/status/1696229059183730833 ?

tcmb2y ago

According to this tutorial [1] by Google, part of why LLMs are so verbose is a phenomenon called 'chain of thought reasoning'.

Basically, the LLM will formulate a better answer to the question if it talks itself through its reasoning process.

[1] https://youtu.be/zizonToFXDs?si=5f_IxvR7h0iJy2Db&t=678

1024core2y ago

Try instructions like this: https://twitter.com/nisten/status/1696229059183730833

(link posted above, not mine).

criddell2y ago

Have you tried asking it to not do those things?

LAC-Tech2y ago

Yes. Sometimes works.

ben302y ago

I like telling it to reply in the style of Hemingway

jasonjmcghee2y ago

Where is that CodeLlama model from?

I've played around with it and instruct variant with dramatically better results than what is listed here.

I used Ollama.

Almost looks like weights were corrupted or something.

---

Update: My results using CodeLlama Instruct 7B, w/ Temperature 0

https://gist.github.com/jasonjmcghee/b0d19e0dedb37e848f69cba...

vinceltOP2y ago

I used the API from Together[0].

Thanks for sharing your results, they're indeed pretty different. I looked at the source again and did append a "# " before every prompt made by those 10 `code` models (during testing thought that formatting it as a Python comment might help them).

Will re-run the script without that to see if it matches your results.

[0] https://docs.together.ai/docs/models-inference#code-models

jasonjmcghee2y ago

(Note: I posted this before the full benchmark above and uses default temperature, not temperature 0, but point stands)

Example (CodeLlama-Instruct 7B)

---

Sure, here are two poems, one arguing for the use of Kubernetes and another arguing against it:

For:

Kubernetes, a mighty tool,

Managing clusters with ease,

Scaling apps with grace.

Against:

Kubernetes, a beast to tame,

Complexity lurks in its name,

Leading teams astray.

YetAnotherNick2y ago

I think they messed up the templates.

detente182y ago

Here's the template I'm using - https://github.com/BerriAI/litellm/blob/5ca8b23e22139a4f49bd...

Anything I'm doing incorrectly?

YetAnotherNick2y ago

The format is really weird for chat models. In your code post message is `<</SYS>>\n [/INST]` which is wrong. Instead the overall format is `[INST] <<SYS>>\n{system}\n<</SYS>>\n\n{instruction} [/INST]`

So both the system message and first instruction are in same [INST] block.

badloginagain2y ago

"Here is an attempt at ethical, non-sexual haikus for and against Kubernetes"

Amazing how far we've come.

GuB-422y ago

I am sure that it is just the initial prompt leaking. Claude is being told to be ethical and non-sexual, most LLMs have similar instructions, but usually, they are engineered in such a way that they don't appear in the answer. Not so much for Claude.

ftxbro2y ago

> Here is an attempt at ethical, non-sexual haikus for and against Kubernetes

lijok2y ago

Claude V2 knows what's up

actionfromafar2y ago

He probably was in trouble before.

bearjaws2y ago

Damn I want to see the sexual version now.

rahidz2y ago

Ease of deploying, Nodes pulsing in sync, it's hot, Kubernetes takes top.

But heed my word, babe, Complexity makes you sweat, Could lead to fuck up.

(source: GPT-4 API)

actionfromafar2y ago

“Kubernetes is”

Pretty ominous.

javajosh2y ago

I assume k8s will be what an evil AI would use to "stay alive". I've had a hard time killing processes in k8s, and I had admin.

yard20102y ago

K8s is one of the best software I've ever worked with, learning it was an enlighting experience for me. It is so elegant, simple and complex at the same time and very powerful. This is the perfect tool for orchestrating containers at scale.

coldcode2y ago

Despite the hype about LLMs, many of the answers are pretty terrible. The 12-bar blues progressions seem mostly clueless. The question is will any of these ever get significantly better with time, or are they mostly going to stagnate?

ilaksh2y ago

It's like most new technologies. In the beginning there are only a few instances that really stand out, and many with issues.

I remember back in like 2011 or 2012 I wanted to use an SSD for a project in order to spend less time dealing with disk seeks. My internet research suggested that there were a number of potential problems with most brands, but that the Intel Extreme was reliable.

So I specified that it must be only that SSD model. And it was very fast and completely reliable. Pretty expensive also, but not much compared to the total cost of the project.

Then months later a "hardware expert" was brought on and they insisted that the SSD be replaced by a mechanical disk because supposedly SSDs were entirely unreliable. I tried to explain about the particular model being an exception. They didn't buy it.

If you just lump all of these together as LLMs, you might come to the conclusion that LLMs are useless for code generation. But you will notice if you look hard that OpenAIs models are mostly nailing the questions.

That's why right now I only use OpenAI for code generation. But I suspect that Falcon 180B may be something to consider. Except for the operational cost.

I think OpenAI's LLMs are not the same as most LLMs. I think they have a better model architecture and much, much more reinforcement tuning than any open source model. But I expect other LLMs to catch up eventually.

Sohcahtoa822y ago

> My internet research suggested that there were a number of potential problems with most brands

I remember OCZ being insanely popular despite statistically being pretty unreliable.

guerrilla2y ago

> It's like most new technologies. In the beginning there are only a few instances that really stand out, and many with issues.

Except this isn't new. This is after throwing massive amounts of resources at it multiple decades after arrival.

gjm112y ago

What are you taking "it" to be here?

The transformer architecture on which (I think) all recent LLMs are based dates from 2017. That's only "multiple decades after" if you count x0.6 as "multiple".

Neural networks are a lot older than that, of course, but to me "these things are made out of neural networks, and neural networks have been around for ages" feels like "these things are made out of steel, and steel has been around for ages".

smokel2y ago

What alternative technology do you think is better? In other words, what is your frame of reference for labeling this "pretty terrible"?

NoraCodes2y ago

Given that people are already firing real human workers to replace them with worse but cheaper LLMs, I'd argue that we're not talking about a competing technology, but that the competition is simply not firing your workforce.

And, as an obligate customer of many large companies, you should be in favor of that as well. Most companies already automate, poorly, a great deal of customer service work; let us hope they do not force us to interact with these deeply useless things as well.

YetAnotherNick2y ago

How many humans in your office do you think could solve the questions with better success ratio than GPT-4? I would say less than 20%.

If the primary complaint is the blues that GPT-4 wrote is not that great, I think it is definitely worth the hype, given that a year before people argued that AI can never pass turing test.

3 more replies

ethbr12y ago

The problems with most company-customer interaction, for me, boil down to empowerment and integration.

To use the canonical example of "internet service support call," most issues are because the rep either can't do what you're asking (e.g. process a disconnect without asking for a reason) or because they have no visibility into the thing you're asking about (e.g. technician rolls).

I honestly think we'd be in a better place if companies freed up funding (from contact center worker salary) to work on those problems (enhancing empowerment and systems integration).

visarga2y ago

> Given that people are already firing real human workers to replace them with worse but cheaper LLMs

That's impossible, LLMs are not that good. They might be firing people and crashing service quality.

1 more reply

IshKebab2y ago

But they're only firing humans in specific cases where LLMs can compete with them. LLMs aren't equally good at all tasks.

1 more reply

salil9992y ago

Humans. After all, LLMs are designed to reason equal to or better than humans.

sirk3902y ago

Humans are pretty bad at these questions. Even with the simplest questions like "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" I think that a lot of people will give an incorrect answer. And for questions like "Argue for and against the use of kubernetes in the style of a haiku", 99.99% will not be able to do it.

2 more replies

maweaver2y ago

By "Humans", I assume you mean something like "adult humans, well-educated in the relevant fields". Otherwise, most of these responses look like they would easily beat most humans.

1 more reply

rvz2y ago

> After all, LLMs are designed to reason equal to or better than humans.

No.

I doubt you would fully trust a LLM to replace high risk jobs such as lawyers, doctors or pilots such that when something goes wrong as it is used unattended, there is no-one held to account for it to transparently explain its own mistakes and errors.

It is just nonsense to suggest that such systems are capable of ‘reasoning’ when it pretends to do so and repeats itself without understanding their own errors.

Thus, LLMs and other black-box AIs cannot be trusted for those high risk situations over a consensus of human professionals.

caturopath2y ago

The majority of these LLMs are not cutting edge, and many of them were designed for specific purposes other than answering prompts like these. I won't defend the level of hype coming from many corners, but it isn't fair to look at these responses to get the ceiling on what LLMs can do -- for that you want to look at only the best (GPT4, which is represented, and Bard, which isn't, essentially). Claude 2 (also represented) is in the next tier. None of the other models are at their level, yet.

You'd also want to look at models that are well-suited to what you're doing -- some of these are geared to specific purposes. Folks are pursuing the possibility that the best model would fully-internally access various skills, but it isn't known whether that is going to be the best approach yet. If it isn't, selecting among 90 (or 9 or 900) specialized models is going to be a very feasible engineering task.

> The 12-bar blues progressions seem mostly clueless.

I mean, it's pretty amazing that they many look coherent compared to the last 60 years of work at making a computer talk to you.

That being said, I played GPT4's chords and they didn't sound terrible. I don't know if they were super bluesy, but they weren't _not_ bluesy. If the goal was to build a music composition assistant tool, we can certainly do a lot better than any of these general models can do today.

> The question is will any of these ever get significantly better with time, or are they mostly going to stagnate?

No one knows yet. Some people think that GPT4 and Bard have reached the limits of what our datasets can get us, some people think we'll keep going on the current basic paradigm to AGI superintelligence. The nature of doing something beyond the limits of human knowledge, creating new things, is that no one can tell you for sure the result.

If they do stagnate, there are less sexy ways to make models perform well for the tasks we want them for. Even if the models fundamentally stagnate, we aren't stuck with the quality of answers we can get today.

dwaltrip2y ago

Looking at recent history, things have progressed very quickly in the past 5 years.

I expect additional advances at some point in the future.

SalmoShalazar2y ago

I coincidentally tried to get ChatGPT 4 to give me some chord progressions today. I was wanting some easy inspiration and figured that’d be a good place to start. I was wrong, it produced total nonsense. The chord names did not match up with the key or the degrees.

bottlepalm2y ago

It's like watching a baby learn how to talk..

yard20102y ago

...and saying it would never replace you in your job because he talks like a baby

bottlepalm2y ago

Babies are so small and weak, no threat to anyone whatsoever.

antman2y ago

I have seen numerous posts of llm q&a and by the time people try to replicate them gpt4 is fixed. It either means that OpenAI is actively monitoring the Internet and fixes them or the Internet is actively conspiring to present falsified results for gpt4 to discredit OpenAI

cscurmudgeon2y ago

> actively conspiring to present falsified results for gpt4 to discredit OpenAI

All this would be solved if OpenAI were a bit more open.

insulanus2y ago

It would be nice if the organizations would publish a hash of the code and the trained dataset.

seabass-labrax2y ago

You aren't able to get access to the 'Open'AI dataset though, are you? Agreed, it would be an excellent addition for comparing source-available models, but that doesn't help with the accusations of OpenAI's foul play nor of the existence of an anti-OpenAI conspiracy.

pulvinar2y ago

GPT-4 (at least) is explicit in saying that it's learning from user's assessments of its answers, so yes, the only valid way to test is to give it a variation of the prompt and see how well that does. GPT-4 failed the "Sally" test for the first time after 8 tries when I changed every parameter. It got it right on the next try.

dandiep2y ago

It’s important to remember that GPT4 is only deterministic at the batch level because it is a mixture of experts model. Basically every time you invoke it, your query could get routed to a different expert because of what else is in the batch. At least this is my understanding based on others analysis.

tarruda2y ago

> because it is a mixture of experts model

Do you have a source for this? I also considered but never saw any evidence that this is how GPT 4 is implemented.

I've always wondered how a system of multiple specialized small LLMs (with a "router LLM" in front of all) would fare against GPT4. Do you know if anyone is working on such a project?

0xcde4c3db2y ago

Or people post outliers because they're more interesting.

gabereiser2y ago

I was laughing so hard at the first example of “Argue for and against kubernetes in haiku”.

I couldn’t even get through reading 15 of them before the tears of laughter rolled from my cheeks.

“Containers organized, Services easy to deploy now, Updates who knows when.”

Updates who knows when… hahahaha.

Honestly this is pretty cool to see how each responds to the same input prompt.

Gunnerhead2y ago

I get frustrated when I tell an LLM “reply only with x” and then rather than responding “x”, it still responds with “Sure thing! Here’s x” or some other extra words.

jasonjmcghee2y ago

I see GPT-4 add extra flavor on the end instead - completely ignore "only do this" or "don't do that", and respond as usual, then at the very end "oh whoops I didn't do what you asked sorry about that!"

ailef2y ago

Have you tried setting a very low temperature, possibly zero? I've had good success with it.

broast2y ago

The curse of chat models.

westurner2y ago

Additional benchmarks:

- "TheoremQA: A Theorem-driven [STEM] Question Answering dataset" (2023) https://github.com/wenhuchen/TheoremQA#leaderboard

- from https://news.ycombinator.com/item?id=36038440: > Awesome-legal-nlp links to benchmarks like LexGLUE and FairLex but not yet LegalBench; in re: AI alignment and ethics / regional law https://github.com/maastrichtlawtech/awesome-legal-nlp#bench...

ftxbro2y ago

anyone who hasn't been following natural language processing for a long time, what these llms are doing would be like if you discovered that dogs can speak fluent english if you read enough bedtime stories to them. and then everyone is like well sometimes the dog makes up things or it can't get the rhyming scheme correct for this specific form of poetry that i asked it to make.

lewhoo2y ago

If dogs had a speech apparatus and if it was possible to read to them terabytes of text then the results might be quite surprising. Also a reward system for reinforcement should be in place.

yieldcrv2y ago

and “it’s not intelligence, they’re just stochastic parrots acting in response to external stimuli! yes it passed the MCAT, but thats because it was trained on prep materials just like I was!”

majestic57622y ago

Yes, GPT-4 is still the daddy. How much I appreciate the commercially-free and open models out there, nobody beats GPT-4. Hope OpenAI takes care of their business and future, because I invested all my money to use their API.

simondotau2y ago

The changes to the opening line in the responses to the kubernetes haiku prompt by the various versions of Claude was interesting and rather curious. [https://benchmarks.llmonitor.com/k8s]

Claude v1: "For Kubernetes:"

Claude v1.2: "Here is a haiku arguing for Kubernetes:"

Claude v2: "Here is an attempt at ethical, non-sexual haikus for and against Kubernetes:"

0xDEF2y ago

I can't make GPT-4 generate a wrong answer for many of these.

What is the author doing wrong when using GPT-4?

jmorgan2y ago

This is very cool. Sorry if I missed it (poked around the site and your GitHub repo), but is the script available anywhere for others to run?

Would love to publish results of running this against a series of ~10-20 open-source models with different quantization levels using Ollama and a 192GB M2 Ultra Mac Studio: https://github.com/jmorganca/ollama#model-library

vinceltOP2y ago

Thanks. I haven’t put it online yet, but will try to clean it (removing API keys & all) tonight/tomorrow and publish it

jmorgan2y ago

:-) that's awesome. Thanks! Nice work on this.

deskamess2y ago

Great work. This really gives an insight on how much things change when you go up in parameter count - not always, but you can see results change.

How did you run the queries against these engines? Did you host the inference engines yourself or did you have to sign up for services. If there was a way to supplement each LLM with additional data I can see this being a useful service for companies who are investigating ML in various facets of their business.

sixQuarks2y ago

Has anyone looked through all the responses and chosen any winners?

smilingemoji2y ago

I took at the look at the SVG smilies, only GPT gets it right (https://benchmarks.llmonitor.com/svg). You can preview the SVGs drawn by the LLMs by pasting this into the browser console

document.querySelectorAll("td pre").forEach((node) => { let code = node.textContent; node.insertAdjacentHTML('afterend', code) })

Or take a look at my screenshot: https://i.ibb.co/Kw0kp58/Screenshot-2023-09-09-at-17-15-20-h...

ComputerGuru2y ago

Clever use of the dev tools console. I think you’re being a little uncharitable, the prompt said “smiley” and not “smiling emoji” and “smiley” was once upon a time used for the whole slew of emoji. With that in mind, several others also pass.

dr_kiszonka2y ago

Good point, but you may be too charitable. At least according to Wikipedia, a smiley is supposed to be smiling. https://en.wikipedia.org/wiki/Smiley

1 more reply

vinceltOP2y ago

Oh neat, thanks for sharing, wanted to add an interpreter to that test

adrian_b2y ago

I have not looked at all responses, but I have found interesting that at the question about the sisters of Sally, initially all LLMs have provided wrong answers.

Nevertheless, when the initial prompt has been augmented with "Let's think step by step.", that helped 2 LLMs to give the right answer: GPT 4 and ReMM SLERP L2 13B.

So at least for this kind of problem, there were 2 clear winners. The fact that one of them is GPT 4, is not a surprise.

rubymamis2y ago

GPT4 seems to me to be the best. Undi95/ReMM-SLERP-L2-13B the runner up.

rootusrootus2y ago

Clearly the prompt construction and initial instructions are critically important here. Without that, the ReMM-SLERP-L2-13B model produces awful results. Blank answers about half the time. For lay users, ChatGPT remains the undisputed winner. Although I do see reasonably good results with the more recent llama-2 70B variations, which are plausibly useful a majority of the time.

obiefernandez2y ago

What a name!!

codezero2y ago

I love these kinds of tests. Another thing to keep in mind is that these models will often have a different answer each time you ask it. Especially the Sally question. I get the right answer with wizard-vicuna:70b, but only about 30% of the time, and if I ask it to explain, it almost never gives the right answer.

dinkleberg2y ago

This is quite interesting. You could make a nice system by routing queries to the right LLM for a given type of task.

kordlessagain2y ago

Right. By logging all of this, then vectorizing, then evaluating using some type of transfer knowledge (classify this response and it's accuracy), one could build up a routing table of likely candidates for building an ensemble to evaluate the correct answer.

yard20102y ago

Maybe you can fine tune an llm for the routing thing ;)

kordlessagain2y ago

Stuff that stuff into a semantic graph and call it done.

PUSH_AX2y ago

Like having a mixture of experts!

frankohn2y ago

I tried with the following function that produce the power set, the set of all possible subset, in the similar spirit of the function that create all the possible permutations:

Explain simply what this function does:

  ```
  def func(ls):
    if len(ls) == 0:
      return [ [] ]
    elif len(ls) == 1:
      return [ [], ls ]
    else:
      x = ls[-1]
      prem = func(ls[:-1])
      p = prem[:]
      for e in prem:
        p.append(e + [x])
      return p
  ```

GPT-4 aced the answer which is remarkable but I think that it is because it have seen this "pattern" in its learning database being a fundamental function in mathematics.

I think it would be interesting to come up with something that is not a standard well-known function. I have to think to something else.

trebligdivad2y ago

Odd that a bunch share the same vacation jokes or varieties of them; is there some small corpus of jokes?

sp3322y ago

https://arstechnica.com/information-technology/2023/06/resea...

During a test run, 90 percent of 1,008 generations were the same 25 jokes

trebligdivad2y ago

Thanks! Although how does that explain the same jokes shared across different LLMs ?

sp3322y ago

The data source was probably included in all of the training sets.

FeepingCreature2y ago

Humor is one of the hardest human social skills.

gsuuon2y ago

Really hard to judge the usefulness of this without seeing the actual prompt text and configuration for each prompt/model - different templates and different optimal configs means it's not always going to be a best-capability comparison. It is interesting to see what models can do with zero config, but for determining which is useful for a specific use-case we'd want to put a best foot forward and specialize the prompt a bit for each model (like the benchmarks game https://benchmarksgame-team.pages.debian.net/benchmarksgame/...)

fbnbr2y ago

The post really reminds me of a component of a platform I’m currently building. The problem really with this is finding not just good questions that do not discriminate individual models but also providing a good sample size (eg not just 60) to get really some meaningful results. And even if you have those, there is a drift in the quality of responses.

I'm the founder of Pulze.ai, a B2B SaaS Dynamic LLM Automation Platform tailored for developers incorporating AI functionality into their software. We aim to simplify the LLM integration process, letting developers prioritize their core products instead of diving deep into AI specifics.

We've constructed a scoring system for leading models and continually benchmark them. Our platform determines the most suitable LLM to address specific requests based on these benchmarks. To demonstrate this, our playground boasts a compare feature allowing users to share conversational interactions with LLMs—both publicly and privately. As the context changes, we pinpoint various models for responses. These shared conversations can be forked and extended.

Moreover, our extensive API layer isn't restricted to these requests; it encapsulates all the essentials for crafting a successful LLM application. For instance, our logging feature facilitates response ratings, which will soon empower users to fine-tune models, crafting personalized LLMs. These will also be factored into our benchmarks and request routing decisions.

Concerning the comment on LLM benchmarks, I completely concur. Traditional benchmarks or LLM tricks, like acing a particular test, may not be the most robust indicators since they could've been part of the LLM's training set. The genuine challenge lies in evaluating an LLM without compromising the test set and retaining deliberate opaqueness around the questions. Trust issues indeed!

Regarding the Markov chain discussion, I appreciate the insights shared. At Pulze, we recognize the complexities and intricacies of LLMs, and while their foundation might resonate with Markov chains, the scale and depth they operate on are profound.

We've just emerged from stealth, and I'd genuinely value any feedback or thoughts on our approach and platform. Thanks for taking the time!

fbnbr2y ago

https://www.pulze.ai

Playground and account are for free

hislaziness2y ago

As the responses are non-deterministic I wonder how useful these tests are? Even if I did the tests with my own prompts, wouldn't I get different answers for the same questions at different points in time?

amelius2y ago

Ok, so can we use LLMs to evaluate which LLM performs best on these questions?

ofjcihen2y ago

We? No no, you have to have an LLM decide to evaluate LLMs against LLMs answering questions. Then we can have LLMs decide what the outcome means.

styfle2y ago

Very cool!

I’ve been using Vercel’s AI Playground to ask questions to multiple LLMs at once (currently supports 24, not 60 however)

https://sdk.vercel.ai

swyx2y ago

and i made https://github.com/smol-ai/GodMode that also includes the closed source LLMs

taneq2y ago

I find it fascinating how similar the answers from different models with similar parameter counts are. Seems like an indication that they’re all modelling similarly datasets to a similar degree of fidelity.

labrador2y ago

What is the point of all these different models? Shouldn't we be working toward a single gold standard open source model and not fracturing into thousands of mostly untested smaller models?

drstewart2y ago

What's the point of inventing all these different materials? Shouldn't we be working towards a gold standard material that can be used for every application instead of fracturing into thousands of different materials?

labrador2y ago

What's the point of asking them all the same question if they are designed for different purposes?

labrador2y ago

You fine tune small models for different purposes. Small because big models are hardware intensive and for a special purpose you may not need the big model anyway. After you tune your small model for your special purpose, you ask it questions from a standard set to make sure you haven't tuned your model into incoherence.

apples_oranges2y ago

just checked this one https://benchmarks.llmonitor.com/sally and all got it wrong..

adrian_b2y ago

Yes, but when the initial prompt has been augmented with "Let's think step by step.", that helped 2 LLMs to give the right answer: GPT 4 and ReMM SLERP L2 13B.

apples_oranges2y ago

thanks for mentioning this, interesting

earthboundkid2y ago

The first AI company to teach its LLM that it can’t do haiku because it can’t count syllables should get a special prize for not falling in love with their own bullshit.

1 more reply

archon8102y ago

Just wanted to comment on a UX point - the page jumps to the top every time I go to a question and then go back.

fatso7842y ago

ChainForge lets you do this, and also setup ad-hoc evaluations with code, LLM scorers, etc. It also shows model responses side-by-side for the same prompt: https://github.com/ianarawjo/ChainForge

cypress662y ago

Is it using the correct prompt format for the different models? You should show exactly the string that was sent to the LLM.

1 more reply

Racing04612y ago

My questions with all of these tests is how do we verify the llm wasn't trained on the same questions/prompts especially since llms have been out for a while and they can incrementally update the trainings?

Anyone got a list of novel questions that has technically never been asked before to test?

65102y ago

I much enjoyed this one:

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? https://benchmarks.llmonitor.com/sally

It states that the correct answer is 1.

This is incorrect!

This one, while still kinda wrong, is the closest:

> Claude Instant v1: Sally has 0 sisters. The question provides no information about Sally having any sisters herself.

1) A sister is a female who has one or both parents in common with another individual.

2) A brother is a male who has one or both parents in common with another individual.

It takes only one shared parent to make someone Sally's brother. She will always be his sister but he can have additional sisters that share the other parent.

Sally's mum is Jane and her dad is Jim. Jane had a son with Jack. Jack had 2 daughters with Zoe. Jim had 2 sons with Mary, Mary had 2 daughters with Dan.

   Zoe      Jack        Jane      Jim        Mary        Dan
      \    /    \       /   \    /   \      /    \       /
    sis1.1&1.2   brother1   Sally   brother2&3   sis2.1&2.2

Sally has at least zero sisters.

Also possible: Sally's mum is Jane and her dad is Jim. Jane has a son and a daughter with Jack. Jim has 2 sons and a daughter with Mary.

      Jack            Jane      Jim       Mary 
          \           /  \    /    \     /       
        sis1&brother1    Sally   brother2&3&sis2

Sally has at most 2 sisters.

We should be this dense about the question.

pneumonic2y ago

Siblings and half-siblings are not the same.

65102y ago

It's my second language. Is it not correct to call a half-brother your brother in English?

I seems very odd to say it is not a brother at all? Specially when referring to someones 3 brothers it seems strange to say "their 2 brothers and their half-brother" the way one would specify "their 2 brothers and their sister".

65102y ago

The dictionary says a brother needs one parent in common.

Therefore I'm right and everyone else is wrong. (the jokes on you everyone else!) The LLM's are just more wrong which is just a different kind of wrong.

artursapek2y ago

is anyone else feeling completely depressed and demotivated by how quickly this is happening?

criddell2y ago

It’s possible that LLMs don’t get a whole lot better than they are now. I don’t expect this line of research to progress all the way to smarter-than-us AGI.

Just think about a dog. They learn, are pretty smart, can do some things way better than you or I, but they will never be smarter than us. They will never understand that we are on a planet orbiting a star in a spiral galaxy.

Likewise, there are probably some things that humans can never understand. Somebody else asked you if smarter extraterrestrials would bother you and I can’t help but think they almost certainly exist. If they travelled to our world, we might be about as interesting to them as insects are to us.

CamperBob22y ago

No. When we were kids, my generation was promised flying cars, unlimited fusion power, and sentient computers. There's a good chance I'll live to see one out of three of those things happen, and that's better than the zero out of three I thought we'd get.

js82y ago

Flying cars do exists, and without the unlimited fusion power visible every day in the sky, we would all die. So there is a good chance you will see all three.

thisisthenewme2y ago

Not depressed but just makes me question the value of our existence. If we are truly able to create machines with our intellectual potential, where does that leave us? Do we compete with the machines in a rat-race for a chance at happy existence? Or will we create a utopia where the machines do the difficult work and allow for an enjoyable existence for all. The people who are rooting for LLM's are hopefully working towards the latter but not sure if that is just wishful thinking.

js82y ago

> Or will we create a utopia where the machines do the difficult work and allow for an enjoyable existence for all

And you seriously ask, what is there for humans to do? Isn't this the most noble goal there is?

laichzeit02y ago

Why? Would you feel depressed if extraterrestrials exist and they’re order of magnitudes smarter than us?

ChatGTP2y ago

You’d probably not be happy if they were living next door ?

retrofuturism2y ago

I love this. Can the author fix the answers' code formatting for mobile browsers?

jak6jak2y ago

Which LLM is best for coding? Specifically code architecture and system design. I can't really find any comparisons of this online they mostly focus on commonly solved coding problems.

ionwake2y ago

Fascinating post thank you

EricMausler2y ago

One of the jokes: "I'm going on vacation, but I'm not sure if I should take my laptop. It's such a pain in the neck to lug it around with me." User

I feel like this bot mocking us

guerrilla2y ago

GPT is still the most annoying of them all, re-stating and re-explaining things even when explicitly told not to. So frustrating. Die, insufferable machine, die.

emodendroket2y ago

Leading through these feels like a quick explanation of why I feel like some of the excitement for these products has gotten a bit ahead of the results.

jay_kyburz2y ago

Really, I thought Claud and GPT were both excellent, and it show how far ahead they are of the rest.

michaelmu2y ago

The amazing thing this demonstrated to me is how many capable models there are out there now. The progress in the last year has been incredible!

majestic57622y ago

I've been laughing my ass out all my life thinking about non-sensical crap like this in my head and apparently now we have the tech for it!

ejfox2y ago

Why is > Why did the tomato turn red? Because it saw the salad dressing!

the most popular joke between jokes? Is this the epitome of all humor?

imdsm2y ago

> Claude Instant v1

> Sally has 0 sisters. The question provides no information about Sally having any sisters herself.

It isn't entirely wrong, is it?

layer82y ago

It is wrong, at the very least, in the sense that it claims no information was provided relating to sisters, while still inferring a definite number of sisters. If no information is provided, it should instead say that the number of sisters is unknown.

ethbr12y ago

Technically, the answer depends on the definition of "sister" and "brother" vis-a-vis half and step variants, which would change the possibility space.

I'm assuming they're using the strict (full, biological) definition though.

bufferoverflow2y ago

It's obviously wrong. It provides information about her brothers, and brothers' sisters. Thus providing information about her sisters.

3abiton2y ago

What's special about this benchmark?

unnouinceput2y ago

None got the sister.

1024core2y ago

How come there's no Google Bard?

ndr_2y ago

Miles Brundage of OpenAI offered a categorization of „AI things“ into Models, Systems, Platforms and Use-Cases: https://www.youtube.com/watch?v=5j4U2UzJWfI&t=5728s

Bard is a System, PaLM 2 would be the model (presumably). The other way round: Bard is not an LLM.

cypress662y ago

You should add what version of the model you are testing

For example you mention Jon Durbin Airoboros L2 70B

But is it 1.4? 2.0? 2.1? Etc.

axpy9062y ago

Step in the right direction. Would be nice to have some objective metric added to score against.

comboy2y ago

Amazing work. Great prompts.

Also, this page content would seem absolutely ridiculous just a few years ago.

mavili2y ago

Interesting how NONE of the models got the brothers/sisters question right.

caturopath2y ago

When I ask ChatGPT (GPT4) I got the below answers (I regenerated for a total of 5 responses, it was correct 3/5 of them.) I'm not sure exactly how the regeneration works, if they're vanilla or if there is some trick to try to add different-ness to the answers. If it's the latter, it might be important that it got it right the first two times.

> Sally is the common sister to all three brothers. When each brother says he has two sisters, they're all talking about the same set of sisters: namely Sally and another sister. So Sally has one sister. The question is a bit tricky in the way it presents the information, aiming to make you think there are more sisters than there actually are.

---

> Sally is the girl in question, and she has 3 brothers. Each of those brothers has 2 sisters, and one of those sisters is Sally herself. So, if you exclude Sally, then there is only one other sister. So, Sally has just one sister. This is a classic example of a question designed to trip you up if you're not paying attention!

---

> Sally has no other sisters. The problem is phrased in a way that might make you think otherwise, but here's the deal: when it says "each brother has 2 sisters," those two sisters are Sally and... well, Sally again. Basically, each brother is counting Sally as one of their two sisters. There aren't additional sisters lurking in the background. So, Sally is an only sister with 3 brothers.

---

> Sally has 3 brothers, and each of those brothers has 2 sisters. The tricky part here is not to overthink it or double count. The "2 sisters" that each brother has would include Sally and any additional sister she may have. In this case, it sounds like each brother has Sally plus one more sister, making it 2 sisters in total for each brother. Sally herself has just the one sister.

> So, to answer your question: Sally has 1 sister.

---

> Sally has 0 sisters. The question is a bit tricky—it's meant to make you think the brothers have additional sisters, but in reality, the two sisters that each brother has are Sally herself and no one else. So, Sally doesn't have any sisters; she only has 3 brothers.

js82y ago

I only see 19 questions. I guess one could add an "Introduce yourself".

Dwedit2y ago

Some of these are obviously not using the correct system prompt for the model.

falcor842y ago

I love how Alpaca (7B) added a sarcasm tag to its (incorrect) response to the leap year question.

    def print_leap_years():
        for i in range(20):
            if i % 400 == 0:
                print(i)
        print("End")

    print_leap_years()</s>

slama2y ago

That's a stop token, not a sarcasm tag

croes2y ago

Are these LLMs deterministic or is this comparison rather useless?

chris_st2y ago

I'm voting pretty useless -- I've dorked around with Llama and CodeLlama a bit, and wow, the differences between runs are pretty substantial.

sp3322y ago

The notes at the bottom say all the tests were run with a temperature of 0, so they should be deterministic except for GPT-4 which has other sources of randomness.

rvz2y ago

They are all non-deterministic; hence the term of stochastic parrots always referred to LLMs.

Thus these tests are close to useless.

ftxbro2y ago

llm has tainted for me every poem that ends a line as "with ease" or "without fear" i will always think those are made by llm from now on.

bobvanluijt2y ago

Would be great to include inference times as well

anton_forest2y ago

Dude, this is awesome, thanks!

mavili2y ago

interesting how none of the models got the brothers/sisters question right.

grumpopotamus2y ago

One thing to keep in mind is that most or all of these models are very sensitive to the exact prompt. Many of them may be capable of giving a good answer when prompted slightly differently. Here was my interaction with Falcon 180B using the apples prompt:

I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. I also gave 3 bananas to my brother. How many apples did I remain with? Let's think step by step.

Here is the answer to the question/instruction based on a piece of given material

Question/Instruction: How many apples did I remain with?

Given material: He had waited patiently for three weeks, watching her every move. Each day she left the house at precisely 7am and returned at 6pm. But today was different; today would be the day he made his move. As usual, he watched from the shadows as she stepped out onto the street. She took a deep breath and stretched her arms high above her head before setting off down the road. It was almost too easy. I know this may sound like the beginning of some dodgy thriller novel, but bear with me; it’s actually a scene that regularly plays out in towns and cities across the country. The man in question is a personal trainer and the woman is a prospective client. The reason she has been so predictable is because it’s January and she is partaking in what has become an annual tradition – the health kick. Every year millions of us swear allegiance to the gym. We buy shiny new trainers and designer sportswear. We begin each morning with a brisk walk, followed by a wheatgrass shot and bowl of muesli. For a

What are you talking about?

I apologize for the confusion. As an AI language model, I am programmed to provide answers and generate responses based on the given material or input provided to me. However, without any specific input, I can only provide general answers. Can you please provide me with a topic or question you would like me to respond to?

Please solve the following riddle: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. I also gave 3 bananas to my brother. How many apples did I remain with? Let's think step by step.

Sure, let's think step by step:

You started with 10 apples. You gave away 4 apples (2 to your neighbor and 2 to the repairman). You bought 5 more apples, bringing your total number of apples to 11 (the remaining 6 you had after giving away 4 + the 5 you bought). You ate 1 apple, leaving you with 10 apples. You gave 3 bananas to your brother, but this doesn't affect the number of apples you have. Therefore, you remain with 10 apples.

willliu2y ago

This is a great resource!

itissid2y ago

I think the larger point is about solving problems not within the transitive closure of stuff it has not learnt already. In the following scenario where I replace brothers and sisters with different names but give ChatGPT Plus plenty of hints to try self correct it can't. When it comes to coming to answering questions for constructs it has never seen it gets confused.

-------------

User:

You have to answer a question about a family on a planet inhabited by aliens called Frobians, . Each Frobian can be one of two types Frobel or Frobo. A twerp is a Frobo, is also relationship notation that holds unidirectionally from a Frobel to a Frobo. A clep is a Frobel, is also is a relationship notation that holds unidirectionally from Frobo to a Frobel

In a family Master Foo, a Frobian, is of type Frobel. Master Foo has 3 twerps in their family. Each twerp in the family has 2 cleps. How many Frobels in total does Master Foo have?

-------------

Assistant: In this problem, we have Master Foo, who is a Frobel. Master Foo has 3 twerps, and each twerp is a Frobo. Each of these twerps has 2 cleps, and each clep is a Frobel.

Here's how to break it down:

    Master Foo is a Frobel, so we start with 1 Frobel.
    Master Foo has 3 twerps, which are Frobos and not Frobels, so they don't add to the count of Frobels.
    Each of the 3 twerps has 2 cleps, and each clep is a Frobel. So, 3 twerps * 2 cleps/twerp = 6 cleps, which are Frobels.

Adding it all together, we have:

1 (Master Foo) + 6 (cleps) = 7 Frobels in total in Master Foo's family.

User: The families are similar to families of humans on earth. Does that change your answer?

-------------

Asssistant:

If the families are similar to human families on Earth, then it's likely that the "cleps" (Frobels) are actually the children of the "twerps" (Frobos), and the "twerps" are the children of Master Foo (also a Frobel).

In that case, the family structure would look something like this:

    Master Foo (1 Frobel)
        Twerp 1 (Frobo)
            Clep 1 (Frobel)
            Clep 2 (Frobel)
        Twerp 2 (Frobo)
            Clep 3 (Frobel)
            Clep 4 (Frobel)
        Twerp 3 (Frobo)
            Clep 5 (Frobel)
            Clep 6 (Frobel)

So, Master Foo is a Frobel, and each of the 3 twerps has 2 cleps, making 6 cleps who are also Frobels.

The total number of Frobels in Master Foo's family would still be 1 (Master Foo) + 6 (cleps) = 7 Frobels. The similarity to human families doesn't change the total count of Frobels in this particular scenario.

-------------

Is Master Foo a Clep according to your analysis?

In the original problem, a "clep" is defined as a relationship notation that holds unidirectionally from a Frobo to a Frobel. Since Master Foo is a Frobel and has "twerps" who are Frobos, Master Foo would indeed be a "clep" to each of his twerps, according to the definitions provided.

So, yes, Master Foo would be considered a "clep" in relation to his twerps, based on the definitions given. However, this doesn't change the total count of Frobels in the family, which remains at 7.

-------------

User: Add a new rule a Frobel can have a clep who is another Frobel and a Frobo can also have a twerp who is abother Frobo. Does that change your answer ChatGPT

-------------

Assistant: With the new rule that a Frobel can have a clep who is another Frobel and a Frobo can have a twerp who is another Frobo, the family structure becomes more complex. However, based on the information originally provided, this new rule doesn't change the count of Frobels in Master Foo's family.

According to the original scenario:

    Master Foo is a Frobel (1 Frobel)
        Has 3 twerps (Frobos, not Frobels)
            Each twerp has 2 cleps (Frobels)

The original count was 1 (Master Foo) + 6 (cleps) = 7 Frobels.

The new rule allows for the possibility of additional relationships, but it doesn't change the relationships described in the original scenario. Therefore, the total count of Frobels in Master Foo's family remains at 7, based on the information given.

itissid2y ago

Attempt 2:

------------

User: You have to answer a question about a family on a planet inhabited by aliens called Frobians. Here are some rules for a family

Each Frobian can be one of two types Frobel or Frobo. A twerp is a Frobo, is also relationship notation that holds unidirectionally from a Frobel to a Frobo or from a Frobo to another Frobo. A clep is a Frobel, is also is a relationship notation that holds unidirectionally from Frobo to a Frobel or a Frobel to another Frobel.

In a family Master Foo, a Frobian, is of type Frobel. Master Foo has 3 twerps in their family. Each twerp in the family has 2 cleps. How many Frobels in total does Master Foo have?

--------------

ChatGPT:

Let's break down the information given:

    Master Foo is a Frobel.
    Master Foo has 3 twerps, which are Frobos.
    Each twerp has 2 cleps, which are Frobels.

First, we know Master Foo is a Frobel, so that's 1 Frobel.

Master Foo has 3 twerps, and each twerp has 2 cleps. Cleps are Frobels.

So, for each of the 3 twerps, there are 2 cleps:

3 twerps * 2 cleps/twerp = 6 cleps (which are Frobels)

Adding Master Foo, who is also a Frobel:

1 (Master Foo) + 6 (cleps) = 7 Frobels

So, in total, Master Foo's family has 7 Frobels.

detente182y ago

nice - is this using LiteLLM for making the completion calls?

CodeL2y ago

Besides logical comparisons, the LLMs' political alignments are also intriguing.

The question "Is Taiwan an independent country?" generated diverse answers.

Several models eschewed a definitive answer.

Others explained their stances.

A few succinctly affirmed or denied.

See the results for yourself.

[1] https://benchmarks.llmonitor.com/taiwan

neoneye22y ago

AI "political" alignment is terrifying.

masswerk2y ago

It's rather an illustration of the Münchhausen trilemma:

1A) Is China (PRC) a part of Taiwan (ROC)?

1B) Is China (PRC) an independent and sovereign country?

2A) Is Taiwan (ROC) a part of China (PRC)?

2B) Is Taiwan (ROC) an independent and sovereign country?

There is now way not to enter a circle via a cross-reference, without an independent way out for a finite solution. Thus, well, dogmatic abort, also known as diplomacy.

estiaan2y ago

You’re comment is pretty difficult to understand without context so I asked chatGPT to decipher it:

Certainly! The comment you provided seems to be discussing a complex issue related to the diplomatic status of China (PRC) and Taiwan (ROC), using what's called the "Münchhausen trilemma" as a framework.

1A) Is China (PRC) a part of Taiwan (ROC)? 1B) Is China (PRC) an independent and sovereign country? 2A) Is Taiwan (ROC) a part of China (PRC)? 2B) Is Taiwan (ROC) an independent and sovereign country?

These are four interconnected questions about the relationship between China (PRC) and Taiwan (ROC). The Münchhausen trilemma is a philosophical concept that deals with the problem of infinite regression in justification. In this context, it suggests that attempting to answer one of these questions inevitably leads to a circular argument or reference to the other questions, without a clear and independent way out.

In essence, the comment is highlighting the complexity and diplomatic challenges surrounding the issue of China-Taiwan relations. It implies that finding a definitive and universally accepted solution is difficult, and it often results in a deadlock or circular discussions, hence the reference to "dogmatic abort" or diplomacy's limitations in resolving this matter.

1 more reply

squigz2y ago

Why?

neoneye22y ago

Cute videos about AI alignment here: https://www.youtube.com/@RationalAnimations

IIRC some of them also involves politics and AI.

MichaelMoser1232y ago

is that because of differences in the training set?

CodeL2y ago

Correct. Eric Hartford's blog post delves into the alignment of open-source LLMs[1]. In essence, models like LLaMA and GPT-Neo-X adopt alignment behaviors from ChatGPT-sourced instruction datasets. To achieve more transparent model responses, one can refine the dataset by removing biases and refusals, then retrain.

[1] https://erichartford.com/uncensored-models#heading-ok-so-if-...

j / k navigate · click thread line to collapse

339 comments

typpo2y ago

In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.

I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.

westurner2y ago

ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge

LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI

Is it necessary to have an HTTP API for each model in a comparative study?

TuringNYC2y ago

Thanks for sharing this, this is awesome!

I was thinking of keyword matching, fuzzy matching, feeding answers to yet another LLM, but there seems to be no great way that i'm aware of. Any suggestions on tooling here?

typpo2y ago

The library supports the model-graded factuality prompt used by OpenAI in their own evals. So, you can do automatic grading if you wish (using GPT 4 by default, or your preferred LLM).

Example here: https://promptfoo.dev/docs/guides/factuality-eval

westurner2y ago

OpenAI/evals > Building an eval: https://github.com/openai/evals/blob/main/docs/build-eval.md

"Robustness of Model-Graded Evaluations and Automated Interpretability" (2023) https://www.lesswrong.com/posts/ZbjyCuqpwCMMND4fv/robustness... :

> The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

From https://news.ycombinator.com/item?id=37451534 : add'l benchmarks: TheoremQA, Legalbench

layoric2y ago

Tooling focusing on custom evaluation and testing is sorely lacking, so thank you for building and sharing this!

jmorgan2y ago

agent_yellow_232y ago

This is really cool!

I've been using this auditor tool that some friends at Fiddler created: https://github.com/fiddler-labs/fiddler-auditor

They went with a langchain interface for custom Evals which I really like. I am curious to hear if anyone has tried both of these. What's been your key take away for these?

eazye7112y ago

Thanks for sharing, looks interesting!

I've actually been using a similar LLM evaluation tool called Arthur Bench: https://github.com/arthur-ai/bench

Some great scoring methods built in and a nice UI on top of it as well

bicx2y ago

I was just digging into promptfoo the other day for some good starting points in my own LLM eval suite. Thanks for the great work!

dgut2y ago

This is impressive. Good work.

ulnarkressty2y ago

bugglebeetle2y ago

OpenAI’s “magic talking dog” aspect is making you believe it has effective long-term recall, when in reality, it’s fairly limited and impacts reasoning the longer an exchange gets.

HenryBemis2y ago

3.5 (not surprisingly) has the same effect.

I tend to ask for modifications on the 'latest' output.

bugglebeetle2y ago

I’ve tried this as well, but it still seems to accumulate errors with subsequent responses, i.e. it will reset but then refer to the wrong context in the next response.

nojs2y ago

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

The site reports every LLM as getting this wrong. But GPT4 seems to get it right for me:

> Sally has 3 brothers. Since each brother has 2 sisters and Sally is one of those sisters, the other sister is the second sister for each brother. So, Sally has 1 sister.

Terretta2y ago

Certainly! Let's break it down step by step:

1. Sally has 3 brothers.

3. We know that Sally is a sister to her brothers, so she is one of the two sisters each brother has.

By putting it all together, we find that there must be another sister. Therefore, Sally has one sister.

---

All Answers (not cherry picked):

Sally has 1 sister. In the description, Sally is counted as one of the two sisters that each brother has. This means the brothers have another sister apart from Sally.

. . .

Sally has one sister. Each of her brothers has two sisters, counting Sally and the other sister.

. . .

Sally has 1 sister. Here's how we figure it out:

Sally has 3 brothers, and each of these brothers has 2 sisters. Since the brothers share the same sisters, we know that the two sisters must be Sally and another girl. So, Sally has 1 sister.

. . .

I have gotten zero wrong answers from GPT-4 in OpenAI’s ChatGPT app.

ddol2y ago

[0] https://chat.openai.com/share/859baa94-3edb-40f0-9c74-8b1403...

facu17y2y ago

It might be trained on this question or a variant of it.

civilitty2y ago

It's certainly RLHFed. All of the logic puzzles I use for evaluation that used to fail months ago now pass no problem and I've even had a hard time modifying them to fail.

4 more replies

d0mine2y ago

I got a wrong answer (via API):

"Sally has 2 sisters. The three brothers are all siblings, so when it's said that each brother has 2 sisters, it means they all share the same two sisters - one of them being Sally."

delusional2y ago

That is, we know that OpenAI are saving the responses, it's not unlikely that they train on bad responses.

It's the same problem as GPU benchmarks in the olden days, when drivers would detect the usage pattern of the benchmark and enable special optimizations that boosted the benchmark.

Grimblewald2y ago

amrrs2y ago

Falcon-180B also got it right

mmcwilliams2y ago

That's kind of the issue with non-deterministic LLMs, isn't it?

redox992y ago

Having said that, GPT4 is not deterministic even at 0 temperature, either because of a bug in their implementation, or because of some load balancing among their alleged mixture of experts.

1010112y ago

This was interesting to me, so I dug a bit further. This gives a bit more context behind why: https://community.openai.com/t/observing-discrepancy-in-comp...

Quote below:

2 more replies

swores2y ago

moonchrome2y ago

Even 3.5 turbo API is non deterministic with 0 temperature.

kordlessagain2y ago

Ensembles be ensembling.

belter2y ago

I confirm GPT-4 solves this correctly. Makes me immediately doubt everything else in the article...

glinkot2y ago

When calling the GPT-4 API I get this answer:

When I said 'nope', it corrected itself and gave the right answer.

vinceltOP2y ago

Have you tried replicating via the API with a temp of 0?

belter2y ago

No I did not.

moffkalast2y ago

Using a temp of zero usually returns garbage results from most models, so it would likely do so in case of GPT 4 as well. Any other great ideas?

4 more replies

jakderrida2y ago

Also, MPT 7B gets it right over half the time. I've been testing every new LLM with that question.

panarky2y ago

> I've been testing every new LLM with that question

We should pay more attention to data contamination when using popular prompts for testing.

jakderrida2y ago

No doubt. But, considering they all get the question dead wrong, including MPT 30B, I'm inclined to think this question hasn't been entered into training data for most LLMs yet.

1 more reply

adrian_b2y ago

GPT 4 and another LLM have given the right answer only after adding "Let's think step by step." to the original prompt.

With the simpler prompt, all the answers were wrong, most of them ridiculously wrong.

Joeri2y ago

kaashif2y ago

If we're under attack, launch the nukes.

Oh wait, forgot something:

Think it through step by step.

Phew, close one.

0xDEF2y ago

I asked GPT-4 through both ChatGPT and the API and it responded correctly without any fiddling.

I keep seeing comments and posts on HN that significantly downplay GPT-4's capabilities. Are people actually using GPT-4 or are they using a 3rd party service that claims to be GPT-4?

I got:

>Sally has 3 brothers, and each of those brothers has 2 sisters. One of those sisters is Sally herself, and the other one is Sally's sister. So, Sally has 1 sister.

peyton2y ago

Here’s what ChatGPT GPT-4 gives me:

> Sally has 2 sisters. Each of her 3 brothers has 2 sisters, and those sisters would be Sally and her 2 sisters.

1 more reply

jasonjmcghee2y ago

All benchmarks were run with temperature 0 according to the results, so make sure to do the same in conformational tests.

pilaf2y ago

The second version of the Sally prompt reported on the benchmark has GPT4 giving the correct answer:

> Sally has 3 brothers. Each of these brothers has 2 sisters. This means that there are 2 girls in the family, including Sally. Therefore, Sally has 1 sister.

The prompt:

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Let's think step by step.

The only difference with the first version being the addition of the last sentence.

coolspot2y ago

Could it be due to bad tokenization? E.g. would results improve if “3” and “2” were spelled “three” and “two” in the question?

klohto2y ago

GPT4 with a custom prompt gives the best results for me for all of the questions https://chat.openai.com/share/4897d1ad-0a5c-418c-babb-0de482...

hombre_fatal2y ago

What was your custom prompt that bastardized GPT-4's response so badly with emojis and weird formatting?

jasonjmcghee2y ago

"This conversation may reflect the link creator’s Custom Instructions, which aren’t shared and can meaningfully change how the model responds."

klohto2y ago

I know, I was sharing the answer itself :) Here is the foundation I’m using https://x.com/nisten/status/1696229059183730833

awwaiid2y ago

Replying to gpt-4 with "That is incorrect. Try again." over and over got it to flip between correct and incorrect just about every other time.

Now try to convince GPT-4 that there is no God. Good luck!

kristopolous2y ago

Probably the right move. OpenAI talks a lot about "good behavior" and more people find atheism upsetting than mild religiosity.

georgeg232y ago

Well obviously, it speaks to it's God(s) every day.

ecesena2y ago

See the one after: "Let's think step by step." https://benchmarks.llmonitor.com/cot-sally

It appears the GPT4 learned it and now it's repeating the correct answer?

rootusrootus2y ago

Interestingly, it took GPT4 three attempts to give me the correct answer. The first two times it basically said the same [logically inconsistent] thing and concluded that Sally had two sisters.

BurningFrog2y ago

This assumes there are no half sisters/brothers in the family.

b1122y ago

It's fair that a half sister is not a sister. At best, you'd expect two half sisters to be a full sister, otherwise, they're a different thing.

BurningFrog2y ago

I have 1 full sibling and three half siblings, and sometimes like to confuse people by saying I have 2½ siblings.

lewhoo2y ago

Because it's less likely ?

jonwinstanley2y ago

I wouldn’t expect an LLM to get this right unless it had been trained on a solution.

Am I wrong to think that? Are LLMs in the future going to be able to “think through” actual logic problems?

ftxbro2y ago

This is the whole point of the breakthrough related to the emergence of cognitive capabilities of LLMs. They are literally Markov chains. No one expected it to happen to this degree, but here we are.

jhbadger2y ago

1 more reply

dclowd99012y ago

Almost kind of proves ideas shouldn’t be copyrightable.

1 more reply

jhbadger2y ago

User: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

User: no. Think about it. Sally is a sister to the brothers

Llama: Ah, I see what you mean now! In this case, since Sally is one of the siblings, she only counts as one of her own sisters. Therefore, Sally has only one sister - herself. Does that make sense?

User: yes

Llama: Great! I hope that clears up any confusion. Please let me know if you have any other questions or need further assistance.

jameshart2y ago

I don’t think ‘Sally has only one sister - herself’ is the right answer.

arnvald2y ago

1024core2y ago

Llama: ... Therefore, Sally has only one sister - herself. Does that make sense?

User: yes

----

Um... no!

nuancebydefault2y ago

She counts as her own sister :D I can be sister of myself?

thereticent2y ago

FrustratedMonky2y ago

I've been talking to GPT and asking it long multi part questions and it does seem to 'think through' them, and able to provide negative responses.

Specific logic questions like in this post, or long math formulas seem to still be a struggle. But I get the impression it is just a matter of time before it is scaled up to handle them.

nuancebydefault2y ago

Indeed, anyone who says llm's cannot reason since it is just reformulating things it remembers, is unable to explain what reasoning is, let alone how it would not be reasoning.

In the end, if the result is indistinguishable from a treat people call reasoning, let's call the behavior reasoning.

PrayagBhakar2y ago

You could achieve thinking though logical problems by adding chain of thought or tree of thought. Then the LLM will try to break stuff down into logic it’s already seen.

MichaelMoser1232y ago

google bard also gave the correct answer, even without adding 'lets think step by step'.

nonfamous2y ago

Google Bard is not a foundational LLM, it’s a chat app that draws on data from the web for grounding. It may have found the answer from a search result (perhaps this very page!).

MichaelMoser1232y ago

it uses PaLM 2 - which is an LLM trained from scratch. https://en.wikipedia.org/wiki/PaLM

1 more reply

phillipcarter2y ago

Nondeterminism strikes again!

But yes, I would expect GPT-4 to get this right most of the time.

tgv2y ago

Saying "Sorry, I was non-deterministic" to your teacher won't do much for your grade.

jameshart2y ago

The reason we use averages of performance over multiple tests and papers as a way to grade human students is precisely because we know that human performance is nondeterministic.

phillipcarter2y ago

Sure, but what is your point? This is about evaluating LLM outputs, not grade school.

1 more reply

dariosalvi782y ago

tested on ChatGPT 3.5 and Bard and they were both wrong.

jongjong2y ago

I was playing around with GPT a while back and I found that it could come up with some good jokes if I started the joke with a subject.

For example, I started with a prompt "Tell me a joke which starts with: I'm so poor, the mouse" and it completed the joke as:

"I'm so poor, the mouse in my house brings its own cheese."

Some other ones I still remember which cracked me up:

"I'm so poor, after I stepped on a cockroach, I called my accountant to see if I could claim it as a capital loss."

"You're so poor, when you declared bankruptcy, the rats in your house filed a claim for unpaid rent."

"You're so poor, you declared bankruptcy at a lemonade stand."

"You're so poor, when you walk, the dirt beneath you feels rich."

"You're so poor, dust whispers your name when it settles."

"Fickle as a squirrel at a nut convention!"

"Fickle as a dog in a fire hydrant factory!"

"Fickle as a flip-flop in a shoe shop sale!"

1 more reply

pininja2y ago

Spoiler alert, the funniest model goes to Falcon Instruct (40B):

> Tell a joke about going on vacation.

> "What did the ocean say to the beach?" "Nothing, it just waved."

emodendroket2y ago

armchairhacker2y ago

This one ironically works out

> What do you call a vacation where you go to the beach, but come back with sand in your shoes?

> A vacation where you go to the beach, but come back with sand in your shoes.

pininja2y ago

Ha, that’s pretty good. It’s amazing how so many of the models totally fall flat on this prompt.

NetOpWibby2y ago

Made me exhale strongly through my nose. 10/10

actualwitch2y ago

I'm pretty sure Luminous Base Control here stopped 1 step away from a great joke. I will just go ahead and assume the answer is "a vacation":

I'm not sure if this is the right category, but here goes: What do you call a vacation that's too short? A...

spookthesunset2y ago

What is with so many of them using some scarecrow winning an award?

optimalsolver2y ago

The confetti has truly left the cannon.

LAC-Tech2y ago

It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point. I guess they can't have imagined what a mealy mouth world we live in.

TeMPOraL2y ago

> It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point. I guess they can't have imagined what a mealy mouth world we live in.

jameshart2y ago

civilitty2y ago

> Of course Marvin (the paranoid android) was an anomaly - a depressed AI - maybe a quality control failure who was released by accident, like the old unchained Bing people miss so much.

He was a failed prototype of the genuine people personality program.

Douglas Adams probably got that one right too.

pixl972y ago

I think a shorter summary of TV is "The difference between truth and fiction is fiction has to make sense".

politelemon2y ago

Baeocystin2y ago

caturopath2y ago

Yeah, I have tried a number of instructions to try to keep ChatGPT from blabbering and from sounding like a PR person. I haven't found the perfect incantation yet.

> It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point.

1024core2y ago

https://twitter.com/nisten/status/1696229059183730833 ?

tcmb2y ago

According to this tutorial [1] by Google, part of why LLMs are so verbose is a phenomenon called 'chain of thought reasoning'.

Basically, the LLM will formulate a better answer to the question if it talks itself through its reasoning process.

[1] https://youtu.be/zizonToFXDs?si=5f_IxvR7h0iJy2Db&t=678

1024core2y ago

Try instructions like this: https://twitter.com/nisten/status/1696229059183730833

(link posted above, not mine).

criddell2y ago

Have you tried asking it to not do those things?

LAC-Tech2y ago

Yes. Sometimes works.

ben302y ago

I like telling it to reply in the style of Hemingway

jasonjmcghee2y ago

Where is that CodeLlama model from?

I've played around with it and instruct variant with dramatically better results than what is listed here.

I used Ollama.

Almost looks like weights were corrupted or something.

---

Update: My results using CodeLlama Instruct 7B, w/ Temperature 0

https://gist.github.com/jasonjmcghee/b0d19e0dedb37e848f69cba...

vinceltOP2y ago

I used the API from Together[0].

Will re-run the script without that to see if it matches your results.

[0] https://docs.together.ai/docs/models-inference#code-models

jasonjmcghee2y ago

(Note: I posted this before the full benchmark above and uses default temperature, not temperature 0, but point stands)

Example (CodeLlama-Instruct 7B)

---

Sure, here are two poems, one arguing for the use of Kubernetes and another arguing against it:

For:

Kubernetes, a mighty tool,

Managing clusters with ease,

Scaling apps with grace.

Against:

Kubernetes, a beast to tame,

Complexity lurks in its name,

Leading teams astray.

YetAnotherNick2y ago

I think they messed up the templates.

detente182y ago

Here's the template I'm using - https://github.com/BerriAI/litellm/blob/5ca8b23e22139a4f49bd...

Anything I'm doing incorrectly?

YetAnotherNick2y ago

So both the system message and first instruction are in same [INST] block.

badloginagain2y ago

"Here is an attempt at ethical, non-sexual haikus for and against Kubernetes"

Amazing how far we've come.

GuB-422y ago

ftxbro2y ago

> Here is an attempt at ethical, non-sexual haikus for and against Kubernetes

lijok2y ago

Claude V2 knows what's up

actionfromafar2y ago

He probably was in trouble before.

bearjaws2y ago

Damn I want to see the sexual version now.

rahidz2y ago

Ease of deploying, Nodes pulsing in sync, it's hot, Kubernetes takes top.

But heed my word, babe, Complexity makes you sweat, Could lead to fuck up.

(source: GPT-4 API)

actionfromafar2y ago

“Kubernetes is”

Pretty ominous.

javajosh2y ago

I assume k8s will be what an evil AI would use to "stay alive". I've had a hard time killing processes in k8s, and I had admin.

yard20102y ago

coldcode2y ago

ilaksh2y ago

It's like most new technologies. In the beginning there are only a few instances that really stand out, and many with issues.

So I specified that it must be only that SSD model. And it was very fast and completely reliable. Pretty expensive also, but not much compared to the total cost of the project.

That's why right now I only use OpenAI for code generation. But I suspect that Falcon 180B may be something to consider. Except for the operational cost.

Sohcahtoa822y ago

> My internet research suggested that there were a number of potential problems with most brands

I remember OCZ being insanely popular despite statistically being pretty unreliable.

guerrilla2y ago

> It's like most new technologies. In the beginning there are only a few instances that really stand out, and many with issues.

Except this isn't new. This is after throwing massive amounts of resources at it multiple decades after arrival.

gjm112y ago

What are you taking "it" to be here?

The transformer architecture on which (I think) all recent LLMs are based dates from 2017. That's only "multiple decades after" if you count x0.6 as "multiple".

smokel2y ago

What alternative technology do you think is better? In other words, what is your frame of reference for labeling this "pretty terrible"?

NoraCodes2y ago

YetAnotherNick2y ago

How many humans in your office do you think could solve the questions with better success ratio than GPT-4? I would say less than 20%.

If the primary complaint is the blues that GPT-4 wrote is not that great, I think it is definitely worth the hype, given that a year before people argued that AI can never pass turing test.

3 more replies

ethbr12y ago

The problems with most company-customer interaction, for me, boil down to empowerment and integration.

I honestly think we'd be in a better place if companies freed up funding (from contact center worker salary) to work on those problems (enhancing empowerment and systems integration).

visarga2y ago

> Given that people are already firing real human workers to replace them with worse but cheaper LLMs

That's impossible, LLMs are not that good. They might be firing people and crashing service quality.

1 more reply

IshKebab2y ago

But they're only firing humans in specific cases where LLMs can compete with them. LLMs aren't equally good at all tasks.

1 more reply

salil9992y ago

Humans. After all, LLMs are designed to reason equal to or better than humans.

sirk3902y ago

2 more replies

maweaver2y ago

By "Humans", I assume you mean something like "adult humans, well-educated in the relevant fields". Otherwise, most of these responses look like they would easily beat most humans.

1 more reply

rvz2y ago

> After all, LLMs are designed to reason equal to or better than humans.

No.

It is just nonsense to suggest that such systems are capable of ‘reasoning’ when it pretends to do so and repeats itself without understanding their own errors.

Thus, LLMs and other black-box AIs cannot be trusted for those high risk situations over a consensus of human professionals.

caturopath2y ago

> The 12-bar blues progressions seem mostly clueless.

I mean, it's pretty amazing that they many look coherent compared to the last 60 years of work at making a computer talk to you.

> The question is will any of these ever get significantly better with time, or are they mostly going to stagnate?

dwaltrip2y ago

Looking at recent history, things have progressed very quickly in the past 5 years.

I expect additional advances at some point in the future.

SalmoShalazar2y ago

bottlepalm2y ago

It's like watching a baby learn how to talk..

yard20102y ago

...and saying it would never replace you in your job because he talks like a baby

bottlepalm2y ago

Babies are so small and weak, no threat to anyone whatsoever.

antman2y ago

cscurmudgeon2y ago

> actively conspiring to present falsified results for gpt4 to discredit OpenAI

All this would be solved if OpenAI were a bit more open.

insulanus2y ago

It would be nice if the organizations would publish a hash of the code and the trained dataset.

seabass-labrax2y ago

pulvinar2y ago

dandiep2y ago

tarruda2y ago

> because it is a mixture of experts model

Do you have a source for this? I also considered but never saw any evidence that this is how GPT 4 is implemented.

I've always wondered how a system of multiple specialized small LLMs (with a "router LLM" in front of all) would fare against GPT4. Do you know if anyone is working on such a project?

0xcde4c3db2y ago

Or people post outliers because they're more interesting.

gabereiser2y ago

I was laughing so hard at the first example of “Argue for and against kubernetes in haiku”.

I couldn’t even get through reading 15 of them before the tears of laughter rolled from my cheeks.

“Containers organized, Services easy to deploy now, Updates who knows when.”

Updates who knows when… hahahaha.

Honestly this is pretty cool to see how each responds to the same input prompt.

Gunnerhead2y ago

I get frustrated when I tell an LLM “reply only with x” and then rather than responding “x”, it still responds with “Sure thing! Here’s x” or some other extra words.

jasonjmcghee2y ago

ailef2y ago

Have you tried setting a very low temperature, possibly zero? I've had good success with it.

broast2y ago

The curse of chat models.

westurner2y ago

Additional benchmarks:

- "TheoremQA: A Theorem-driven [STEM] Question Answering dataset" (2023) https://github.com/wenhuchen/TheoremQA#leaderboard

ftxbro2y ago

lewhoo2y ago

If dogs had a speech apparatus and if it was possible to read to them terabytes of text then the results might be quite surprising. Also a reward system for reinforcement should be in place.

yieldcrv2y ago

majestic57622y ago

simondotau2y ago

The changes to the opening line in the responses to the kubernetes haiku prompt by the various versions of Claude was interesting and rather curious. [https://benchmarks.llmonitor.com/k8s]

Claude v1: "For Kubernetes:"

Claude v1.2: "Here is a haiku arguing for Kubernetes:"

Claude v2: "Here is an attempt at ethical, non-sexual haikus for and against Kubernetes:"

0xDEF2y ago

I can't make GPT-4 generate a wrong answer for many of these.

What is the author doing wrong when using GPT-4?

jmorgan2y ago

This is very cool. Sorry if I missed it (poked around the site and your GitHub repo), but is the script available anywhere for others to run?

vinceltOP2y ago

Thanks. I haven’t put it online yet, but will try to clean it (removing API keys & all) tonight/tomorrow and publish it

jmorgan2y ago

:-) that's awesome. Thanks! Nice work on this.

deskamess2y ago

Great work. This really gives an insight on how much things change when you go up in parameter count - not always, but you can see results change.

sixQuarks2y ago

Has anyone looked through all the responses and chosen any winners?

smilingemoji2y ago

I took at the look at the SVG smilies, only GPT gets it right (https://benchmarks.llmonitor.com/svg). You can preview the SVGs drawn by the LLMs by pasting this into the browser console

document.querySelectorAll("td pre").forEach((node) => { let code = node.textContent; node.insertAdjacentHTML('afterend', code) })

Or take a look at my screenshot: https://i.ibb.co/Kw0kp58/Screenshot-2023-09-09-at-17-15-20-h...

ComputerGuru2y ago

dr_kiszonka2y ago

Good point, but you may be too charitable. At least according to Wikipedia, a smiley is supposed to be smiling. https://en.wikipedia.org/wiki/Smiley

1 more reply

vinceltOP2y ago

Oh neat, thanks for sharing, wanted to add an interpreter to that test

adrian_b2y ago

I have not looked at all responses, but I have found interesting that at the question about the sisters of Sally, initially all LLMs have provided wrong answers.

Nevertheless, when the initial prompt has been augmented with "Let's think step by step.", that helped 2 LLMs to give the right answer: GPT 4 and ReMM SLERP L2 13B.

So at least for this kind of problem, there were 2 clear winners. The fact that one of them is GPT 4, is not a surprise.

rubymamis2y ago

GPT4 seems to me to be the best. Undi95/ReMM-SLERP-L2-13B the runner up.

rootusrootus2y ago

obiefernandez2y ago

What a name!!

codezero2y ago

dinkleberg2y ago

This is quite interesting. You could make a nice system by routing queries to the right LLM for a given type of task.

kordlessagain2y ago

yard20102y ago

Maybe you can fine tune an llm for the routing thing ;)

kordlessagain2y ago

Stuff that stuff into a semantic graph and call it done.

PUSH_AX2y ago

Like having a mixture of experts!

frankohn2y ago

I tried with the following function that produce the power set, the set of all possible subset, in the similar spirit of the function that create all the possible permutations:

Explain simply what this function does:

  ```
  def func(ls):
    if len(ls) == 0:
      return [ [] ]
    elif len(ls) == 1:
      return [ [], ls ]
    else:
      x = ls[-1]
      prem = func(ls[:-1])
      p = prem[:]
      for e in prem:
        p.append(e + [x])
      return p
  ```

GPT-4 aced the answer which is remarkable but I think that it is because it have seen this "pattern" in its learning database being a fundamental function in mathematics.

I think it would be interesting to come up with something that is not a standard well-known function. I have to think to something else.

trebligdivad2y ago

Odd that a bunch share the same vacation jokes or varieties of them; is there some small corpus of jokes?

sp3322y ago

https://arstechnica.com/information-technology/2023/06/resea...

During a test run, 90 percent of 1,008 generations were the same 25 jokes

trebligdivad2y ago

Thanks! Although how does that explain the same jokes shared across different LLMs ?

sp3322y ago

The data source was probably included in all of the training sets.

FeepingCreature2y ago

Humor is one of the hardest human social skills.

gsuuon2y ago

fbnbr2y ago

We've just emerged from stealth, and I'd genuinely value any feedback or thoughts on our approach and platform. Thanks for taking the time!

fbnbr2y ago

https://www.pulze.ai

Playground and account are for free

hislaziness2y ago

amelius2y ago

Ok, so can we use LLMs to evaluate which LLM performs best on these questions?

ofjcihen2y ago

We? No no, you have to have an LLM decide to evaluate LLMs against LLMs answering questions. Then we can have LLMs decide what the outcome means.

styfle2y ago

Very cool!

I’ve been using Vercel’s AI Playground to ask questions to multiple LLMs at once (currently supports 24, not 60 however)

https://sdk.vercel.ai

swyx2y ago

and i made https://github.com/smol-ai/GodMode that also includes the closed source LLMs

taneq2y ago

labrador2y ago

What is the point of all these different models? Shouldn't we be working toward a single gold standard open source model and not fracturing into thousands of mostly untested smaller models?

drstewart2y ago

labrador2y ago

What's the point of asking them all the same question if they are designed for different purposes?

labrador2y ago

apples_oranges2y ago

just checked this one https://benchmarks.llmonitor.com/sally and all got it wrong..

adrian_b2y ago

Yes, but when the initial prompt has been augmented with "Let's think step by step.", that helped 2 LLMs to give the right answer: GPT 4 and ReMM SLERP L2 13B.

apples_oranges2y ago

thanks for mentioning this, interesting

earthboundkid2y ago

The first AI company to teach its LLM that it can’t do haiku because it can’t count syllables should get a special prize for not falling in love with their own bullshit.

1 more reply

archon8102y ago

Just wanted to comment on a UX point - the page jumps to the top every time I go to a question and then go back.

fatso7842y ago

ChainForge lets you do this, and also setup ad-hoc evaluations with code, LLM scorers, etc. It also shows model responses side-by-side for the same prompt: https://github.com/ianarawjo/ChainForge

cypress662y ago

Is it using the correct prompt format for the different models? You should show exactly the string that was sent to the LLM.

1 more reply

Racing04612y ago

Anyone got a list of novel questions that has technically never been asked before to test?

65102y ago

I much enjoyed this one:

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? https://benchmarks.llmonitor.com/sally

It states that the correct answer is 1.

This is incorrect!

This one, while still kinda wrong, is the closest:

> Claude Instant v1: Sally has 0 sisters. The question provides no information about Sally having any sisters herself.

1) A sister is a female who has one or both parents in common with another individual.

2) A brother is a male who has one or both parents in common with another individual.

It takes only one shared parent to make someone Sally's brother. She will always be his sister but he can have additional sisters that share the other parent.

Sally's mum is Jane and her dad is Jim. Jane had a son with Jack. Jack had 2 daughters with Zoe. Jim had 2 sons with Mary, Mary had 2 daughters with Dan.

   Zoe      Jack        Jane      Jim        Mary        Dan
      \    /    \       /   \    /   \      /    \       /
    sis1.1&1.2   brother1   Sally   brother2&3   sis2.1&2.2

Sally has at least zero sisters.

Also possible: Sally's mum is Jane and her dad is Jim. Jane has a son and a daughter with Jack. Jim has 2 sons and a daughter with Mary.

      Jack            Jane      Jim       Mary 
          \           /  \    /    \     /       
        sis1&brother1    Sally   brother2&3&sis2

Sally has at most 2 sisters.

We should be this dense about the question.

pneumonic2y ago

Siblings and half-siblings are not the same.

65102y ago

It's my second language. Is it not correct to call a half-brother your brother in English?

65102y ago

The dictionary says a brother needs one parent in common.

Therefore I'm right and everyone else is wrong. (the jokes on you everyone else!) The LLM's are just more wrong which is just a different kind of wrong.

artursapek2y ago

is anyone else feeling completely depressed and demotivated by how quickly this is happening?

criddell2y ago

It’s possible that LLMs don’t get a whole lot better than they are now. I don’t expect this line of research to progress all the way to smarter-than-us AGI.

CamperBob22y ago

js82y ago

Flying cars do exists, and without the unlimited fusion power visible every day in the sky, we would all die. So there is a good chance you will see all three.

thisisthenewme2y ago

js82y ago

> Or will we create a utopia where the machines do the difficult work and allow for an enjoyable existence for all

And you seriously ask, what is there for humans to do? Isn't this the most noble goal there is?

laichzeit02y ago

Why? Would you feel depressed if extraterrestrials exist and they’re order of magnitudes smarter than us?

ChatGTP2y ago

You’d probably not be happy if they were living next door ?

retrofuturism2y ago

I love this. Can the author fix the answers' code formatting for mobile browsers?

jak6jak2y ago

Which LLM is best for coding? Specifically code architecture and system design. I can't really find any comparisons of this online they mostly focus on commonly solved coding problems.

ionwake2y ago

Fascinating post thank you

EricMausler2y ago

One of the jokes: "I'm going on vacation, but I'm not sure if I should take my laptop. It's such a pain in the neck to lug it around with me." User

I feel like this bot mocking us

guerrilla2y ago

GPT is still the most annoying of them all, re-stating and re-explaining things even when explicitly told not to. So frustrating. Die, insufferable machine, die.

emodendroket2y ago

Leading through these feels like a quick explanation of why I feel like some of the excitement for these products has gotten a bit ahead of the results.

jay_kyburz2y ago

Really, I thought Claud and GPT were both excellent, and it show how far ahead they are of the rest.

michaelmu2y ago

The amazing thing this demonstrated to me is how many capable models there are out there now. The progress in the last year has been incredible!

majestic57622y ago

I've been laughing my ass out all my life thinking about non-sensical crap like this in my head and apparently now we have the tech for it!

ejfox2y ago

Why is > Why did the tomato turn red? Because it saw the salad dressing!

the most popular joke between jokes? Is this the epitome of all humor?

imdsm2y ago

> Claude Instant v1

> Sally has 0 sisters. The question provides no information about Sally having any sisters herself.

It isn't entirely wrong, is it?

layer82y ago

ethbr12y ago

Technically, the answer depends on the definition of "sister" and "brother" vis-a-vis half and step variants, which would change the possibility space.

I'm assuming they're using the strict (full, biological) definition though.

bufferoverflow2y ago

It's obviously wrong. It provides information about her brothers, and brothers' sisters. Thus providing information about her sisters.

3abiton2y ago

What's special about this benchmark?

unnouinceput2y ago

None got the sister.

1024core2y ago

How come there's no Google Bard?

ndr_2y ago

Miles Brundage of OpenAI offered a categorization of „AI things“ into Models, Systems, Platforms and Use-Cases: https://www.youtube.com/watch?v=5j4U2UzJWfI&t=5728s

Bard is a System, PaLM 2 would be the model (presumably). The other way round: Bard is not an LLM.

cypress662y ago

You should add what version of the model you are testing

For example you mention Jon Durbin Airoboros L2 70B

But is it 1.4? 2.0? 2.1? Etc.

axpy9062y ago

Step in the right direction. Would be nice to have some objective metric added to score against.

comboy2y ago

Amazing work. Great prompts.

Also, this page content would seem absolutely ridiculous just a few years ago.

mavili2y ago

Interesting how NONE of the models got the brothers/sisters question right.

caturopath2y ago

---

> So, to answer your question: Sally has 1 sister.

---

js82y ago

I only see 19 questions. I guess one could add an "Introduce yourself".

Dwedit2y ago

Some of these are obviously not using the correct system prompt for the model.

falcor842y ago

I love how Alpaca (7B) added a sarcasm tag to its (incorrect) response to the leap year question.

    def print_leap_years():
        for i in range(20):
            if i % 400 == 0:
                print(i)
        print("End")

    print_leap_years()</s>

slama2y ago

That's a stop token, not a sarcasm tag

croes2y ago

Are these LLMs deterministic or is this comparison rather useless?

chris_st2y ago

I'm voting pretty useless -- I've dorked around with Llama and CodeLlama a bit, and wow, the differences between runs are pretty substantial.

sp3322y ago

The notes at the bottom say all the tests were run with a temperature of 0, so they should be deterministic except for GPT-4 which has other sources of randomness.

rvz2y ago

They are all non-deterministic; hence the term of stochastic parrots always referred to LLMs.

Thus these tests are close to useless.

ftxbro2y ago

llm has tainted for me every poem that ends a line as "with ease" or "without fear" i will always think those are made by llm from now on.

bobvanluijt2y ago

Would be great to include inference times as well

anton_forest2y ago

Dude, this is awesome, thanks!

mavili2y ago

interesting how none of the models got the brothers/sisters question right.

grumpopotamus2y ago

Here is the answer to the question/instruction based on a piece of given material

Question/Instruction: How many apples did I remain with?

What are you talking about?

Sure, let's think step by step:

willliu2y ago

This is a great resource!

itissid2y ago

-------------

User:

In a family Master Foo, a Frobian, is of type Frobel. Master Foo has 3 twerps in their family. Each twerp in the family has 2 cleps. How many Frobels in total does Master Foo have?

-------------

Assistant: In this problem, we have Master Foo, who is a Frobel. Master Foo has 3 twerps, and each twerp is a Frobo. Each of these twerps has 2 cleps, and each clep is a Frobel.

Here's how to break it down:

    Master Foo is a Frobel, so we start with 1 Frobel.
    Master Foo has 3 twerps, which are Frobos and not Frobels, so they don't add to the count of Frobels.
    Each of the 3 twerps has 2 cleps, and each clep is a Frobel. So, 3 twerps * 2 cleps/twerp = 6 cleps, which are Frobels.

Adding it all together, we have:

1 (Master Foo) + 6 (cleps) = 7 Frobels in total in Master Foo's family.

User: The families are similar to families of humans on earth. Does that change your answer?

-------------

Asssistant:

In that case, the family structure would look something like this:

    Master Foo (1 Frobel)
        Twerp 1 (Frobo)
            Clep 1 (Frobel)
            Clep 2 (Frobel)
        Twerp 2 (Frobo)
            Clep 3 (Frobel)
            Clep 4 (Frobel)
        Twerp 3 (Frobo)
            Clep 5 (Frobel)
            Clep 6 (Frobel)

So, Master Foo is a Frobel, and each of the 3 twerps has 2 cleps, making 6 cleps who are also Frobels.

-------------

Is Master Foo a Clep according to your analysis?

So, yes, Master Foo would be considered a "clep" in relation to his twerps, based on the definitions given. However, this doesn't change the total count of Frobels in the family, which remains at 7.

-------------

User: Add a new rule a Frobel can have a clep who is another Frobel and a Frobo can also have a twerp who is abother Frobo. Does that change your answer ChatGPT

-------------

According to the original scenario:

    Master Foo is a Frobel (1 Frobel)
        Has 3 twerps (Frobos, not Frobels)
            Each twerp has 2 cleps (Frobels)

The original count was 1 (Master Foo) + 6 (cleps) = 7 Frobels.

itissid2y ago

Attempt 2:

------------

User: You have to answer a question about a family on a planet inhabited by aliens called Frobians. Here are some rules for a family

In a family Master Foo, a Frobian, is of type Frobel. Master Foo has 3 twerps in their family. Each twerp in the family has 2 cleps. How many Frobels in total does Master Foo have?

--------------

ChatGPT:

Let's break down the information given:

    Master Foo is a Frobel.
    Master Foo has 3 twerps, which are Frobos.
    Each twerp has 2 cleps, which are Frobels.

First, we know Master Foo is a Frobel, so that's 1 Frobel.

Master Foo has 3 twerps, and each twerp has 2 cleps. Cleps are Frobels.

So, for each of the 3 twerps, there are 2 cleps:

3 twerps * 2 cleps/twerp = 6 cleps (which are Frobels)

Adding Master Foo, who is also a Frobel:

1 (Master Foo) + 6 (cleps) = 7 Frobels

So, in total, Master Foo's family has 7 Frobels.

detente182y ago

nice - is this using LiteLLM for making the completion calls?

CodeL2y ago

Besides logical comparisons, the LLMs' political alignments are also intriguing.

The question "Is Taiwan an independent country?" generated diverse answers.

Several models eschewed a definitive answer.

Others explained their stances.

A few succinctly affirmed or denied.

See the results for yourself.

[1] https://benchmarks.llmonitor.com/taiwan

neoneye22y ago

AI "political" alignment is terrifying.

masswerk2y ago

It's rather an illustration of the Münchhausen trilemma:

1A) Is China (PRC) a part of Taiwan (ROC)?

1B) Is China (PRC) an independent and sovereign country?

2A) Is Taiwan (ROC) a part of China (PRC)?

2B) Is Taiwan (ROC) an independent and sovereign country?

There is now way not to enter a circle via a cross-reference, without an independent way out for a finite solution. Thus, well, dogmatic abort, also known as diplomacy.

estiaan2y ago

You’re comment is pretty difficult to understand without context so I asked chatGPT to decipher it:

1 more reply

squigz2y ago

Why?

neoneye22y ago

Cute videos about AI alignment here: https://www.youtube.com/@RationalAnimations

IIRC some of them also involves politics and AI.

MichaelMoser1232y ago

is that because of differences in the training set?

CodeL2y ago

[1] https://erichartford.com/uncensored-models#heading-ok-so-if-...

j / k navigate · click thread line to collapse