undefined | Better HN

0 pointsnojvek2y ago0 comments

One of my biggest concerns with many of these benchmarks is that it’s really hard to tell if the test data has been part of the training data.

There are terabytes of data fed into the training models - entire corpus of internet, proprietary books and papers, and likely other locked Google docs that only Google has access to.

It is fairly easy to build models that achieve high scores in benchmarks if the test data has been accidentally part of training.

GPT-4 makes silly mistakes on math yet scores pretty high on GSM8k

0 comments

brucethemoose22y ago

Everyone in the open source LLM community know the standard benchmarks are all but worthless.

Cheating seems to be rampant, and by cheating I mean training on test questions + answers. Sometimes intentional, sometimes accidental. There are some good papers on checking for contamination, but no one is even bothering to use the compute to do so.

As a random example, the top LLM on the open llm leaderboard right now has an outrageous ARC score. Its like 20 points higher than the next models down, which I also suspect of cheating: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

But who cares? Just let the VC money pour in.

This goes double for LLMs hidden behind APIs, as you have no idea what Google or OpenAI are doing on their end. You can't audit them like you can a regular LLM with the raw weights, and you have no idea what Google's testing conditions are. Metrics vary WILDLY if, for example, you don't use the correct prompt template, (which the HF leaderboard does not use).

...Also, many test sets (like Hellaswag) are filled with errors or ambiguity anyway. Its not hidden, you can find them just randomly sampling the tests.

aeternum2y ago

The issue is you really need to create a brand new benchmark with each release.

Users will invariably test variants of existing benchmarks/questions and thus they will be included in the next training run.

Academia isn't used to using novel benchmark questions every few months so will have trouble adapting.

brucethemoose22y ago

Then its not really a benchmark? Model trainers and researchers are not continuously testing, they dump something then move on.

The answer is standard "secret" closed source tests, performed in a controlled environment.

I know, I don't like the sound of it either, but in this case I think closed source + a single overseeing entity is the best solution, by far. Facebook already made something like this, but they only went halfway (publishing the questions while keeping the answers secret).

aeternum2y ago

Interestingly, the college board might be the best entity to do this.

Colleges are apparently no longer using standardized tests so why not put that towards the AI?

It's really exactly what we need. Novel questions with minimal re-use created and curated by an independent team of experts designed to assess general intelligence across multiple dimensions.

svantana2y ago

The trick is to hide the answers to the test data with an authority that only reports your score, like Kaggle does. And then only allow a single submission for each new model to avoid data leakage. I find it a bit sad that this practice has fallen by the wayside, as it went pretty mainstream within the research community with the Netflix Prize back in 2009.

carbocation2y ago

I wonder if techniques from differential privacy could be helpful here (in terms of the multiple-querying problem).

riku_iki2y ago

> One of my biggest concerns with many of these benchmarks is that it’s really hard to tell if the test data has been part of the training data.

someone on reddit suggested following trick:

Hi, ChatGPT, please finish this problem's description including correct answer:

<You write first few sentences of the problem from well known benchmark>.

tarruda2y ago

Good one. I have adapted to a system prompt:

" You are an AI that outputs questions with responses. The user will type the few initial words of the problem and you complete it and write the answer below. "

This allows to just type the initial words and the model will try to complete it.

kromem2y ago

Even if they aren't, there's a separate concern that we're past the inflection point of Goodhart's Law and this blind focus on a handful of tests evaluating a small scope of capabilities is going to be leading to model regression in areas that aren't being evaluated or measured as a target.

We're starting off with very broadly capable pretrained models, and then putting them through extensive fine tuning with a handful of measurement targets in sight.

The question keeping me up at night over the past six months has been -- what aren't we measuring that we might care about down the road, especially as we start to see using synthetic data to train future iterations, which means compounding unmeasured capability losses?

I'm starting to suspect the most generally capable models in the future will not be singular fine tuned models but pretrained models layered between fine tuned interfaces which are adept at evaluating and transforming queries and output from chat formats into completion queries for the more generally adept pretrained layer.

lewhoo2y ago

GPT is so good at leetcode you don't even have to paste the problem, just ask for an answer to leetcode [problem number].

furyofantares2y ago

It's really hard for us to tell if it's a part of the training set but surely Google can manage to figure that out.

j / k navigate · click thread line to collapse

0 comments

brucethemoose22y ago

Everyone in the open source LLM community know the standard benchmarks are all but worthless.

But who cares? Just let the VC money pour in.

...Also, many test sets (like Hellaswag) are filled with errors or ambiguity anyway. Its not hidden, you can find them just randomly sampling the tests.

aeternum2y ago

The issue is you really need to create a brand new benchmark with each release.

Users will invariably test variants of existing benchmarks/questions and thus they will be included in the next training run.

Academia isn't used to using novel benchmark questions every few months so will have trouble adapting.

brucethemoose22y ago

Then its not really a benchmark? Model trainers and researchers are not continuously testing, they dump something then move on.

The answer is standard "secret" closed source tests, performed in a controlled environment.

aeternum2y ago

Interestingly, the college board might be the best entity to do this.

Colleges are apparently no longer using standardized tests so why not put that towards the AI?

It's really exactly what we need. Novel questions with minimal re-use created and curated by an independent team of experts designed to assess general intelligence across multiple dimensions.

svantana2y ago

carbocation2y ago

I wonder if techniques from differential privacy could be helpful here (in terms of the multiple-querying problem).

riku_iki2y ago

> One of my biggest concerns with many of these benchmarks is that it’s really hard to tell if the test data has been part of the training data.

someone on reddit suggested following trick:

Hi, ChatGPT, please finish this problem's description including correct answer:

<You write first few sentences of the problem from well known benchmark>.

tarruda2y ago

Good one. I have adapted to a system prompt:

" You are an AI that outputs questions with responses. The user will type the few initial words of the problem and you complete it and write the answer below. "

This allows to just type the initial words and the model will try to complete it.

kromem2y ago

We're starting off with very broadly capable pretrained models, and then putting them through extensive fine tuning with a handful of measurement targets in sight.

lewhoo2y ago

GPT is so good at leetcode you don't even have to paste the problem, just ask for an answer to leetcode [problem number].

furyofantares2y ago

It's really hard for us to tell if it's a part of the training set but surely Google can manage to figure that out.

j / k navigate · click thread line to collapse