GPT-4o's Memory Breakthrough – Needle in a Needlestack (opens in new tab)

(nian.llmonpy.ai)

478 pointsparrt2y ago239 comments

239 comments

This is based on a limericks dataset published in 2021. https://zenodo.org/records/5722527

I think it very likely that gpt-4o was trained on this. I mean, why would you not? Innnput, innnput, Johnny five need more tokens.

I wonder why the NIAN team don't generate their limericks using different models, and check to make sure they're not in the dataset? Then you'd know the models couldn't possibly be trained on them.

sftombu2y ago

I tested the LLMs to make sure they could not answer the questions unless the limerick was given to them. Other than 4o, they do very badly on this benchmark, so I don't think the test is invalidated by their training.

cma2y ago

Why wouldn't it still be invalidated by it if it was indeed trained on it? The others may do worse and may or may not have been trained on it, but them failing on ititself doesn't imply 4o can do this well without the task being present in the corpus.

djsjajah2y ago

A better test would be to see if it can still answer the question if you just exclude the limerick for that answer. Having a bunch of limericks in the context window will make it "think" about all of the limericks it "knows".

sftombu2y ago

It can't answer the questions without the limericks in the prompt. The benchmark is to establish how well it uses the context window. For example, I just asked it "What is sought by the English top brass?". The answer from the limerick is "Cranberry glass" and 4o answers correctly when given the associated limerick once out of 2500+ limericks.

However, without the limerick, 4o responded with: "The term "English top brass" typically refers to high-ranking officials or leaders within the British government, military, or other institutions. What they seek can vary widely depending on the context and the specific goals of their roles. Here are some general pursuits that might be sought by such individuals:

National Security: Ensuring the safety and security of the United Kingdom from internal and external threats is a primary concern. This involves defense strategies, intelligence operations, and counter-terrorism efforts.

Economic Stability: High-ranking officials often focus on policies and initiatives aimed at maintaining and improving the country’s economic health. This includes managing inflation, unemployment, trade relations, and economic growth.

Political Influence: Top brass often seek to maintain or expand their influence both domestically and internationally. This can involve diplomacy, forming alliances, and participating in international organizations like the United Nations or NATO.

Social Cohesion: Ensuring social stability and addressing issues such as inequality, healthcare, education, and social services are critical. This can involve implementing policies that promote social welfare and cohesion.

Public Policy Implementation: Leaders are responsible for developing and implementing policies that reflect the government’s priorities. This includes legislation, regulatory frameworks, and public administration.

Technological Advancement: Keeping the nation at the forefront of technological innovation is often a priority. This includes investments in research and development, supporting tech industries, and ensuring cybersecurity.

Environmental Sustainability: Addressing climate change and promoting sustainable practices are increasingly important. This includes policies aimed at reducing carbon emissions, protecting natural resources, and transitioning to renewable energy sources.

Cultural and Heritage Preservation: Protecting and promoting the country’s cultural heritage and national identity can also be a focus. This includes supporting the arts, preserving historical sites, and promoting cultural initiatives.

These pursuits are shaped by the current political climate, global trends, and the specific priorities of the leaders in question. Would you like more detailed information on any of these areas?"

4 more replies

dontupvoteme2y ago

It would be interesting to know how it acts if you ask it about one that isn't present, or even lie to it (e.g. take a limerick that is present but change some words and ask it to complete it)

Maybe some models hallucinate or even ignore your mistake vs others correcting it (depending on the context ignoring or calling out the error might be the more 'correct' approach)

Using limericks is a very nifty idea!

neverokay2y ago

Why not just generate complete random stuff and ask it to find stuff in that?

Kostchei2y ago

We have run that test.- generate random string(not by llm) names of values- ask the llm to do math (algebra) using those strings. Tests logic, 100% not in the data set GPT2 was like 50% accurate, now we up around the 90%.

dontupvoteme2y ago

NIAN is a very cool idea, but why not simply translate it into N different languages (you even can mix services, e.g. deepl/google translate/LLMs themselves) and ask about them that way?

internet1010102y ago

No disassemble!

bearjaws2y ago

I just used it to compare two smaller legal documents and it completely hallucinated that items were present in one and not the other. It did this on three discrete sections of the agreements.

Using ctrl-f I was able to see that they were identical in one another.

Obviously this is a single sample but saying 90% seems unlikely. They were around ~80k tokens total.

carlosbaraza2y ago

I have the same feeling. I asked to find duplicates in a list of 6k items and it basically hallucinated the entire answer multiple times. Some times it finds some, but it interlaces the duplicates with other hallucinated items. I wasn't expecting it to get it right, cause I think this task is challenging with a fixed amount of attention heads. However, the answer seems much worse than Claude Opus or GPT-4.

akomtu2y ago

Everyone is trying to use Language Models as Reasoning Models because the latter haven't been invented yet.

fnordpiglet2y ago

That’s not needle in a haystack.

I would note that LLMs handle this task better if you slice the two documents into smaller sections and iterate section by section. They aren’t able to reason and have no memory so can’t structurally analyze two blobs of text beyond relatively small pieces. But incrementally walking through in much smaller pieces that are themselves semantically contained and related works very well.

The assumption that they are magic machines is a flawed one. They have limits and capabilities and like any tool you need to understand what works and doesn’t work and it helps to understand why. I’m not sure why the bar for what is still a generally new advance for 99.9% of developers is effectively infinitely high while every other technology before LLMs seemed to have a pretty reasonable “ok let’s figure out how to use this properly.” Maybe because they talk to us in a way that appears like it could have capabilities it doesn’t? Maybe it’s close enough sounding to a human that we fault it for not being one? The hype is both overstated and understated simultaneously but there have been similar hype cycles in my life (even things like XML were going to end world hunger at one point).

HarHarVeryFunny2y ago

That's a different test than needle-in-a needlestack, although telling in how brittle these models are - competent in one area, and crushingly bad in others.

Needle-in-a-needlestack contrasts with needle-in-a-haystack by being about finding a piece of data among similar ones (e.g. one specific limeric among thousands of others), rather than among disimilar ones.

1970-01-012y ago

I've done the same experiment with local laws and caught GPT hallucinating fines and fees! The problem is real.

tmaly2y ago

Imagine if they started using LLMs to suggest prison sentences

Aerbil3132y ago

Interesting, because the (at least the official) context window of GPT-4o is 128k.

davedx2y ago

> Obviously this is a single sample but saying 90% seems unlikely.

This is such an anti-intellectual comment to make, can't you see that?

You mention "sample" so you understand what statistics is, then in the same sentence claim 90% seems unlikely with a sample size of 1.

The article has done substantial research

dkjaudyeqooe2y ago

That fact that it has some statistically significant performance is irrelevant and difficult to evaluate for most people.

He's a much simpler and correct description that almost everyone can understand: it fucks up constantly.

Getting something wrong even once can make it useless for most people. No amount of pedantry will change this reality.

davedx2y ago

What on earth? The experimental research demonstrates that it doesn't "fuck up constantly", you're just making things up. The various performance metrics people around the world to measure and compare model performance is not irrelevant because you, some random internet commenter, claim so without any evidence.

This isn't pedantry, it's science.

lopuhin2y ago

And also article is testing on a different task (Needle in a Needlestack which is kind of similar to Needle in a Haystack), compared to finding a difference between two documents. For sure it's useful to know that the model does ok in one and really bad in the other, does not mean that original test is flawed.

bckr2y ago

Yeah I asked for an estimate of the percentage of the US population that lives in the DMV area (DC, Maryland, Virginia) and it was off by 50% of the actual answer, which I only realized when I realized I shouldn’t trust its estimate for anything important

KeplerBoy2y ago

Those models still can't reliably do arithmetic, so how could it possibly know that number unless it's a commonly repeated fact?

Also: would you expect random people to fare any better?

bckr2y ago

It used web search (RAG over the entire web) and analysis (math tool) and still came up with the wrong answer.

It has done more complex things for me than this and, sometimes, gotten it right.

Yes, it’s supposed to be able to do this.

chrischen2y ago

Arithmetic just happens to be something we can easily and reliably verify, so it becomes painfully obvious when LLMs are just stringing together some words that sound like the right answer.

kylebenzle2y ago

What you are asking an llm to do here makes no sense.

potatoman222y ago

Why not? It seems like a natural language understanding task

marshray2y ago

You haven't seen the promotion of the use of LM AI for handling legal documents?

It's purported to be a major use case.

cmrdporcupine2y ago

You might be right but I've lost count of the number of startups I've heard of trying to do this for legal documents.

thorum2y ago

The needle in the haystack test gives a very limited view of the model’s actual long context capabilities. It’s mostly used because early models were terrible at it and it’s easy to test. In fact, most recent models now do pretty good at this one task, but in practice, their ability to do anything complex drops off hugely after 32K tokens.

RULER is a much better test:

https://github.com/hsiehjackson/RULER

> Despite achieving nearly perfect performance on the vanilla needle-in-a-haystack (NIAH) test, all models (except for Gemini-1.5-pro) exhibit large degradation on tasks in RULER as sequence length increases.

> While all models claim context size of 32k tokens or greater (except for Llama3), only half of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama2-7b performance at 4K (85.6%). The performance exceeding the threshold is underlined.

WhitneyLand2y ago

Maybe, but

1. The article is not about NIHS it’s their own variation so it could be more relevant.

2. The whole claim of the article is that Gpt4o does better, but the test your pointing to hasn’t benchmarked it.

sftombu2y ago

The models benchmarked by RULER do worse in needle in a needlestack. It will be interested to see how 4o does with RULER.

19h2y ago

I'd like to see this for Gemini Pro 1.5 -- I threw the entirety of Moby Dick at it last week, and at one point all books Byung Chul-Han has ever published, and it both cases it was able to return the single part of a sentence that mentioned or answered my question verbatim, every single time, without any hallucinations.

nsagent2y ago

A number of people in my lab do research into long context evaluation of LLMs for works of fiction. The likelihood is very high that Moby Dick is in the training data. Instead the people in my lab have explored recently published books to avoid these issues.

See BooookScore (https://openreview.net/forum?id=7Ttk3RzDeu) which was just presented at ICLR last week and FABLES (https://arxiv.org/abs/2404.01261) a recent preprint.

theptip2y ago

I suppose the question then is - if you finetune on your own data (eg internal wiki) does it then retain the near-perfect recall?

Could be a simpler setup than RAG for slow-changing documentation, especially for read-heavy cases.

k__2y ago

"if you finetune on your own data (eg internal wiki) does it then retain the near-perfect recall"

No, that's one of the primary reasons for RAG.

1 more reply

robbiep2y ago

I’m not involved in the space, but it seems to me that having a model, in particular a massive model, exposed to a corpus of text like a book in the training data would have very minimal impact. I’m aware that people have been able to return data ‘out of the shadows’ pf the training data but to my mind a model being mildly influenced by the weights between different words in this text hardly constitute hard recall, if anything it now ‘knows’ a little of the linguistic style of the authour.

How far off am I?

int_19h2y ago

It depends on how many times it had seen that text during training. For example, GPT-4 can reproduce ayats from the Quran word for word in both Arabic and English. It can also reproduce the Navy SEAL copypasta complete with all the typos.

2 more replies

Salgat2y ago

Remember, it's also trained on countless internet discussions and papers on the book.

westurner2y ago

HN post re: FABLES: https://news.ycombinator.com/item?id=39982362

FABLES/booklist.md: https://github.com/mungg/FABLES/blob/main/booklist.md

/gscholar_citations? BoookScore: https://scholar.google.com/scholar?cites=1796862036168524911...

...

From that one day awhile ago: https://news.ycombinator.com/item?id=38347868#38354679 :

> "LLMs cannot find reasoning errors, but can correct them" [ https://arxiv.org/abs/2311.08516 ] https://news.ycombinator.com/item?id=38353285

Fernicia2y ago

But this content is presumably in its training set, no? I'd be interested if you did the same task for a collection of books published more recently than the model's last release.

19h2y ago

To test this hypothesis, I just took the complete book "Advances in Green and Sustainable Nanomaterials" [0] and pasted it into the prompt, asking Gemini: "What absorbs thermal radiations and converts it into electrical signals?".

It replied: "The text indicates that graphene sheets present high optical transparency and are able to absorb thermal radiations with high efficacy. They can then convert these radiations into electrical signals efficiently.".

Screenshot of the PDF with the relevant sentence highlighted: https://i.imgur.com/G3FnYEn.png

[0] https://www.routledge.com/Advances-in-Green-and-Sustainable-...

jiggawatts2y ago

Ask it what material absorbs “infrared light” efficiently.

To me, that’s useful intelligence. I can already search text for verbatim matches, I want the AI to understand that “thermal radiations” and “infrared light” are the same thing.

2 more replies

kaibee2y ago

Honestly I think testing these on fiction books would be more impressive. The graphene thing I'm sure shows up in some research papers.

a_wild_dandan2y ago

Gemini works with brand new books too; I've seen multiple demonstrations of it. I'll try hunting one down. Side note: this experiment is still insightful even using model training material. Just compare its performance with the uploaded book(s) to without.

ben_w2y ago

I would hope that Byung-Chul Han would not be in the training set (at least not without his permission), given he's still alive and not only is the legal question still open but it's also definitely rude.

This doesn't mean you're wrong, though.

sebzim45002y ago

It's pretty easy to confirm that copywritten material is in the training data. See the NYT lawsuit against OpenAI for example.

1 more reply

DominikPeters2y ago

Just put the 2500 example linked on the article through Gemini 1.5 Flash and it answered correctly ("The tree has diseased leaves and its bark is peeling.") https://aistudio.google.com/

sftombu2y ago

Interesting!

parrtOP2y ago

Wow. Cool. I have access to that model and have also seen some impressive context extraction. It also gave a really good summary of a large code base that I dumped in. I saw somebody analyze a huge log file, but we really need something like this needle in a needlestack to help identify when models might be missing something. At the very least, this could give model developers something to analyze their proposed models.

19h2y ago

Funnily enough I ran a 980k token log dump against Gemini Pro 1.5 yesterday to investigate an error scenario and it found a single incident of a 429 error being returned by a third-party API provider while reasoning that "based on the file provided and the information that this log file is aggregated of all instances of the service in question, it seems unlikely that a rate limit would be triggered, and additional investigation may be appropriate", and it turned out the service had implemented a block against AWS IPs, breaking a system that loads press data from said API provider, leaving the customer who was affected by it without press data -- we didn't even notice or investigate that, and Gemini just randomly mentioned it without being prompted for that.

parrtOP2y ago

That definitely makes it seem like it's noticing a great deal of its context window. impressive.

causality02y ago

Man, we are like 2-5 years away from being able to feed in an ePub and get an accurate graphic novel version in minutes. I am so ready to look at four thousand paintings of Tolkien trees.

sftombu2y ago

If I had access to Gemini with a reasonable token rate limit, I would be happy to test Gemini. I have had good results with it in other situations.

cj2y ago

What version of Gemini is built into Google Workspace? (I just got the ability today to ask Gemini anything about emails in my work Gmail account, which seems like something that would require a large context window)

underlines2y ago

Such tasks don't need a large context window. Just good RAG.

youssefabdelm2y ago

Someone needs to come up with a "synthesis from haystack" test that tests not just retrieval but depth of understanding, connections, abstractions across diverse information.

When a person reads a book, they have an "overall intuition" about it. We need some way to quantify this. Needle in haystack tests feel like a simple test that doesn't go far enough.

jddj2y ago

An elaborate Agatha Christie style whodunit, with a series of plot-twists and alibis which can be chopped off the end of the piece to modify who is the most likely suspect

jddj2y ago

Or a spot the difference.

Generate 1000 generic facts about Alice and the same 1000 facts about Eve. Randomise the order and change one minor detail then ask how they differ.

youssefabdelm2y ago

That seems to go back in the direction of needle in the haystack again

pushedx2y ago

    sort alice.txt | diff - <(sort eve.txt)

That's not a task for an LLM

2 more replies

visarga2y ago

The needles form a graph and the prompt asks graph based tasks.

sftombu2y ago

That is an interesting idea

Eisenstein2y ago

My idea is to buy to a unpublished novel or screenplay with a detailed, internally consistent world built in to it and a cast of characters that have well crafted motivations and then ask it to continue writing from an arbitrary post-mid-point by creating a new plot line that combines two characters that haven't yet met in the story. If it understands the context it should be able to write a new part of the story and will be able to use a reader's intuitive sense of the character's motivations to move through their arc.

This whole thing would have to be kept under lock-and-key in order to be useful, so it would only serve as a kind of personal benchmark. Or it could possibly be a prestige award that is valued for its conclusions and not for its ability to use the methodology to create improvements in the field.

semi-extrinsic2y ago

Just use memes. People generate new high-quality niche memes so fast it's impossible for the LLMs to keep up.

visarga2y ago

You can only use it for a short while, they get a copy as well.

Eisenstein2y ago

I have been thinking about this for use in evaluating locally run models, so I didn't make that connection in this case. I guess it would have limited utility.

sftombu2y ago

I was thinking about something similar -- to make part of the question be sufficient information that the LLM can find the limerick. Then the 2nd part would ask something that would require a deeper understanding of the limerick (or other text).

borgdefense2y ago

There is no understanding, it can't do this.

GPT4o still can't do the intersection of two different ideas that are not in the training set. It can't even produce random variations on the intersection of two different ideas.

Further though, we shouldn't expect the model to do this. It is not fair to the model and its actual usefulness and how amazing what the models can do with zero understanding. To believe the model understands is to fool yourself.

nebula88042y ago

I wonder if there is some way to have an AI help humans improve their "reading comprehension" aka reasoning across a large body of text. As far as I can tell the only way to do this is to cut out mindless scrolling and force yourself to read a lot of books in the hopes that this skill might be improved.

I am many years out of my grade school years where I was required to read a multitude of novels every year and I guess years of mindless reddit scrolling + focusing on nothing but mathematics and the sciences in college have taken their toll: I read long articles or books but completely miss the deeper meaning.

As an example: my nerd like obsession with random topics of the decade before I was born (until I get bored) caused me to read numerous articles and all of Wikipedia + sources on the RBMK reactors and Chernobyl nuclear accident as well as the stories of the people involved.

But it wasn't until I sat down and watched that famous HBO mini seres that I finally connected the dots of how the lies and secretive nature of the soviet system led to the design flaws in the reactor, and the subsequent suicide of Valery Legasov helped finally expose them to the world where they could no longer be hidden.

Its like I knew of all these events and people separately but could not connect them together to form a deep realization and when I saw it acted out on screen it all finally hit me like a ton of bricks. How had I not seen it?

Hoping one day AI can just scan my existing brain structure and recommend activities to change the neuronal makeup to what I want it to be. Or even better since im a lazy developer, it should just do it for me.

adamgordonbell2y ago

I've been thinking about that as well.

It's hard, but if you have a piece of fiction or non-fiction it hasn't seen before, then a deep reading comprehension question can be a good indicator. But you need to be able to separate a true answer from BS.

"What does this work says about our culture? Support your answer with direct quotes."

I found both gpt-4 and haiku to do alright at this, but sometimes give answers that imply fixating on certain sections of a 20,000 k context. You could compare it against chunking the text, getting the answer for each chunk and combining them.

I suspect if you do that then the chunking would win for things that are found in many chunks, like the work is heavy handed on a theme, but the large context would be better for a sublter message, except sometimes it would miss it altogether and think a Fight Club screenplay was a dark comedy.

Interpretation is hard I guess.

1 more reply

segmondy2y ago

Why can't you be that someone?

gremlinsinc2y ago

lol, made me think of the euphemism: be the change you want to see.

yatz2y ago

Well, I can now use GPT to transform raw dynamic data into beautiful HTML layouts on the fly for low-traffic pages, such as change/audit logs, saving a ton of development time and keeping my HTML updated even when the data structure has changed. My last attempt did not consistently work because GPT4-Turbo sometimes ignored the context and instructions almost entirely.

ijidak2y ago

Do you have an example of this? I would love to learn more.

yatz2y ago

Here is the entire prompt. I used rules to ensure the formatting is consistent as otherwise sometimes it might format date one way and other times in an entirely different way.

Imagine, a truly dynamic and super personal site, where layout, navigation, styling and everything else gets generated on the fly using user's usage behavior and other preferences, etc. Man! ---------------------------------------------

{JSON} ------ You are an auditing assistant. Your job is to convert the ENTIRE JSON containing "Order Change History" into a human-readable Markdown format. Make sure to follow the rules given below by letter and spirit. PLEASE CONVERT THE ENTIRE JSON, regardless of how long it is. --------------------------------------------- RULES: - Provide markdown for the entire JSON. - Present changes in a table, grouped by date and time and the user, i.e., 2023/12/11 12:40 pm - User Name. - Hide seconds from the date and time and format using the 12-hour clock. - Do not use any currency symbols. - Format numbers using 1000 separator. - Do not provide any explanation, either before or after the content. - Do not show any currency amount if it is zero. - Do not show IDs. - Order by date and time, from newest to oldest. - Separate each change with a horizontal line.

balder19912y ago

I guess you just need to offer a template in the prompt? Then maybe some validation after.

yatz2y ago

No templates, just some rules and the model does the rest. It worked like a charm, even gave me ideas on how to layout and format the page to make it easy to read.

parrtOP2y ago

The article shows how much better GPT-4o is at paying attention across its input window compared to GPT-4 Turbo and Claude-3 Sonnet.

We've needed an upgrade to needle in a haystack for a while and this "Needle In A Needlestack" is a good next step! NIAN creates a prompt that includes thousands of limericks and the prompt asks a question about one limerick at a specific location.

mianos2y ago

I agree, I paid for Claude for a while. Even though they swear the context is huge and having a huge context uses up tokens like crack, it's near useless when source code in context just a few pages back. It was so frustrating as everything else was as good as anything and I liked the 'vibe'.

I used 4o last night and it was still perfectly aware of a C++ class I pasted 20 questions ago. I don't care about smart, I care about useful and this really contributes to the utility.

whimsicalism2y ago

Increasingly convinced that nobody on the public internet knows how to do actual LLM evaluations.

tedeh2y ago

I'm just glad that we are finally past the "Who was the 29th president of the United States" and "Draw something in the style of Van Gogh" LLM evaluation test everyone did in 2022-2023.

petulla2y ago

You need to know that this test set data wasn't included in the training data for this to be meaningful.

sftombu2y ago

If you ask the questions without providing the limerick first, it never gets the right answer. When the LLM gets the wrong answer, it is usually because it reverts to its training data and gives a generic answer that doesn't apply to the limerick.

trifurcate2y ago

Why are you ruling out the possibility that training on the material may confer an advantage when the data is presented, even if the advantage may not be strong enough to pass the test without the data present in the context window?

a_wild_dandan2y ago

No you don't. Compare the model's performance before and after uploading the material.

sftombu2y ago

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419s

sumedh2y ago

No such item.

lmeyerov2y ago

I thought the test limericks were autogenerated?

sftombu2y ago

They come from a database of 98k limericks -- https://zenodo.org/records/5722527

personjerry2y ago

That's great to hear. My biggest issue with GPT-4.0 was that as the conversation got longer, the quality diminished (especially relevant for coding projects)

I wonder if it'll be better now. Will test today.

throwthrowuknow2y ago

That’s been my experience so far. My current conversations are crazy long compared to any of my gpt4 convos which I had to frequently copy context from and start over in a new chat

sftombu2y ago

I had the same experience. With a 16k prompt, Turbo was nearly flawless. But it wasn't very good at 32k and not usable at 100+. You have to repeat information to get good results with longer prompts

itissid2y ago

How Do we know that gpt-4o.has not been trained on this dataset?

sftombu2y ago

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419

throwthrowuknow2y ago

This is a very promising development. It would be wise for everyone to go back and revise old experiments that failed now that this capability is unlocked. It should also make RAG even more powerful now that you can load a lot more information into the context and have it be useful.

demilich2y ago

Agreed

feverzsj2y ago

LLMs are still toys, no one should treat them seriously. Apparently, the bubble is too massive now.

infecto2y ago

We have businesses getting real value from these toys. Maybe you have not been in the right circles to experience this?

feverzsj2y ago

Of course you can get value from toy business, but toys are toys.

1 more reply

nopromisessir2y ago

Used toys to write a working machine vision project over last 2 days.

Key word: working

The bubble is real on both sides. Models have limitations... However, they are not toys. They are powerful tools. I used 3 different SotA models for that project. The time saved is hard to even measure. It's big.

SiempreViernes2y ago

> The time saved is hard to even measure. It's big.

You are aware that this is an obvious contradiction, right? Big times savings are not hard to measure.

nopromisessir2y ago

Right... With precision...

Furthermore... big mountains are easier to weigh v small individual atoms? I think it's a little more complicated than big is easy to measure...

I care little about the precision... I've got other priorities. It's the same as the time the internet saves me... Big. It's obvious.

I stand by my statement. It's hard to measure...

cdelsolar2y ago

Must be a pretty cool toy; it constantly 10X’s my productivity.

nopromisessir2y ago

You said it mate. I feel bad for folks who turn away from this technology. If they persist... They will be so confused why they get repeatedly lapped.

I wrote a working machine vision project in 2 days with these toys. Key word: working... Not hallucinated. Actually working. Very useful.

SubiculumCode2y ago

My daughter berated me for using AI (the sentiment among youth is pretty negative, and it is easy to understand why), but I simply responded, "if I don't my peers still will, then we'll be living on the street." And it's true, I've 10x'd my real productivity as a scientist (for example, using llms to help me code one off scripts for data munging, automating our new preprocessing pipelines, etc, quickly generating bullet points for slides).

The trick though is learning how to prompt, and developing the sense that the LLM is stuck with the current prompt and needs another perspective. Funnily enough, the least amount of luck I've had is getting the LLM to write precisely enough for science (yay I still have a job), even without the confabulation, the nuance is lacking...that it's almost always faster for me to write it myself.

1 more reply

davedx2y ago

I just don't understand why AI is so polarising on a technology website.

OpenAI have even added a feature to make the completions from GPT near-deterministic (by specifying a seed). It seems that no matter what AI companies do, there will be a vocal minority shouting that it's worthless.

1 more reply

SiempreViernes2y ago

Without details that's a meaningless stat, I remember some pytorch machine vision tutorials promising they'll only take like an hour, including training and also gives a working project at the end.

davedx2y ago

It's staggering to me that people on Hacker News are actually downvoting people saying how AI is boosting productivity or levering business or engineering or finance. The denial, cynicism and sheer wilful ignorance is actually depressing. I get that not everyone is working directly with AI/ML but I honestly expected better on a website about technology.

People are deliberately self selecting themselves out of the next industrial revolution. It's Darwin Awards for SWE careers. It's making me ranty.

sschueller2y ago

We are all so majorly f*d.

The general public does not know nor understand this limitation. At the same time OpenAI is selling this a a tutor for your kids. Next it will be used to test those same kids.

Who is going to prevent this from being used to pick military targets (EU law has an exemption for military of course) or make surgery decisions?

kromokromo2y ago

This is just doomerism. Even though this model is slightly better than the previous, using an LLM for high risk tasks like healthcare and picking targets in military operations still feels very far away. I work in healthcare tech in a European country and yes we use AI for image recognition on x-rays, retinas etc but these are fundamentally completely different models than a LLM.

Using LLMs for picking military targets is just absurd. In the future, someone might use some other variation of AI for this but LLMs are not very effective on this.

dbspin2y ago

AI is already being used for picking targets in warzones - https://theconversation.com/israel-accused-of-using-ai-to-ta....

LLM's will of course also be used, due to their convenience and superficial 'intelligence', and because of the layer of deniability creating a technical substrate between soldier and civilian victim provides - as has happened for two decades with drones.

throwthrowuknow2y ago

Why? There are many other types of AI or statistical methods that are easier, faster and cheaper to use not to mention better suited and far more accurate. Militaries have been employing statisticians since WWII to pick targets (and for all kinds of other things) this is just current-thing x2 so it’s being used to whip people into a frenzy.

2 more replies

mike_hearn2y ago

Note that the IDF explicitly denied that story:

https://www.idf.il/en/mini-sites/hamas-israel-war-24/all-art...

Probably this is due to confusion over what the term "AI" means. If you do some queries on a database, and call yourself a "data scientist", and other people who call themselves data scientists do some AI, does that mean you're doing AI? For left wing journalists who want to undermine the Israelis (the story originally appeared in the Guardian) it'd be easy to hear what you want to hear from your sources and conflate using data with using AI. This is the kind of blurring that happens all the time with apparently technical terms once they leave the tech world and especially once they enter journalism.

4 more replies

wolfd2y ago

It’s absurd but LLMs for military targets is absolutely something that some companies are trying to sell regardless of the many known failure modes.

https://www.bloomberg.com/news/newsletters/2023-07-05/the-us...

https://youtu.be/XEM5qz__HOU

goopthink2y ago

I also work in healthtech, and nearly every vendor we’ve evaluated in the last 12 months has tacked on ChatGPT onto their feature set as an “AI” improvement. Some of the newer startup vendors are entirely prompt engineering with a fancy UI. We’ve passed on most of these but not all. And these companies have clients, real world case studies. It’s not just not very far away, it is actively here.

lhoff2y ago

>Using LLMs for picking military targets is just absurd. In the future

I guess the future is now then: https://www.theguardian.com/world/2023/dec/01/the-gospel-how...

Excerpt:

>Aviv Kochavi, who served as the head of the IDF until January, has said the target division is “powered by AI capabilities” and includes hundreds of officers and soldiers.

>In an interview published before the war, he said it was “a machine that produces vast amounts of data more effectively than any human, and translates it into targets for attack”.

>According to Kochavi, “once this machine was activated” in Israel’s 11-day war with Hamas in May 2021 it generated 100 targets a day. “To put that into perspective, in the past we would produce 50 targets in Gaza per year. Now, this machine produces 100 targets a single day, with 50% of them being attacked.”

agos2y ago

nothing in this says they used an LLM

2 more replies

coldtea2y ago

>Using LLMs for picking military targets is just absurd

You'd be surprised.

Not to mention it's also used for military and intelligence "analysis".

>using an LLM for high risk tasks like healthcare and picking targets in military operations still feels very far away

When immaturity and unfitness for purpose has ever stopped companies selling crap?

exe342y ago

> picking targets in military operations

I'm 100% on the side of Israel having the right to defend itself, but as I understand it, they are already using "AI" to pick targets, and they adjust the threshold each day to meet quotas. I have no doubt that some day they'll run somebody's messages through chat gpt or similar and get the order: kill/do not kill.

mlnj2y ago

'Quotas each day to find targets to kill'.

That's a brilliant and sustainable strategy. /s

ExoticPearTree2y ago

I use ChatGPT in particular to narrow down options when I do research, and it is very good at this. It wouldn't be far-fetched to feed it a map, traffic patterns and ask it to do some analysis of "what is the most likeliest place to hit"? And then take it from there.

currymj2y ago

i don't know about European healthcare but in the US, there is this huge mess of unstructured text EMR and a lot of hope that LLMs can help 1) make it easier for doctors to enter data, 2) make some sense out of the giant blobs of noisy text.

people are trying to sell this right now. maybe it won't work and will just create more problems, errors, and work for medical professionals, but when did that ever stop hospital administrators from buying some shiny new technology without asking anyone.

CWuestefeld2y ago

I hear these complaints and can't see how this is worse than the pre-AI situation. How is an AI "hallucination" different from human-generated works that are just plain wrong, or otherwise misleading?

Humans make mistakes all the time. Teachers certainly did back when I was in school. There's no fundamental qualitative difference here. And I don't even see any evidence that there's any difference in degree, either.

UncleMeat2y ago

"Sorry, computer says no."

Humans can be wrong, but they aren't able to be wrong at as massive of a scale and they often have an override button where you can get them to look at something again.

When you have an AI deployed system and full automation you've got more opportunities for "I dunno, the AI says that you are unqualified for this job and there is no way around that."

We already see this with less novel forms of automation. There are great benefits here, but also the number of times people are just stymied completely by "computer says no" has exploded. Expect that to increase further.

skywhopper2y ago

Because people know they make mistakes, and aren’t always 100% certain and are capable of referring you to other people. Also because the mistakes LLMs make are entirely unlike mistakes humans make. Humans don’t generate fake URLs citing entirely fake references. Humans don’t apologize when corrected and then re-assert the same mistake. Also because we know that people aren’t perfect and we don’t expect them to be infallible, humans can break out of their script and work around the process that’s been encoded in their computers.

But most people do expect computers to be infallible, and the marketing hype for LLMs is that they are going to replace all human intellectual labor. Huge numbers of people actually believe that. And if you could convince an LLM it was wrong (you can’t, not reliably), it has no way around the system it’s baked into.

All of these things are really really dangerous, and just blithely dismissing it as “humans make mistakes, too, lol” is really naive. Humans can decide not to drop a bomb or shoot a gun if they see that their target isn’t what they expect. AIs never will.

CWuestefeld2y ago

Pretty much every element of the above statements is false. Heck, either your response to me, or this reply, seem to be examples showing that the first one is wrong.

Sophira2y ago

Society has spent literal decades being convinced to put their trust in everything computers do. We're now at the point that, in general, that trust is there and isn't misplaced.

However, now that computers can plausibly do certain tasks that they couldn't before via LLMs, society has to learn that this is an area of computing that can't be trusted. That might be easy for more advanced users who already don't trust what corporations are doing with technology[0], but for most people this is going to be a tall order.

[0] https://i.imgur.com/6wbgy2L.jpeg

lnxg33k12y ago

Probably the main difference is that humans fail at smaller scale, with smaller effects, and build a reputation, probably chatgpt hallucinations can potentially affect everyone

moralestapia2y ago

Humans know when they've made a mistake. So there's ways to deal with that.

Computers are final. You don't want things to be final when your life's on the line.

olddustytrail2y ago

> Humans know when they've made a mistake.

You'll never make senior management with that attitude. At worst, "mistakes were made" and look a bit sad.

unclebucknasty2y ago

>There's no fundamental qualitative difference here...degree either.

I've heard the same comparisons made with self-driving cars (i.e. that humans are fallible, and maybe even more error-prone).

But this misses the point. People trust the fallibility they know. That is, we largely understand human failure modes (errors in judgement, lapses in attention, etc) and feel like we are in control of them (and we are).

OTOH, when machines make mistakes, they are experienced as unpredictable and outside of our control. Additionally, our expectation of machines is that they are deterministic and not subject to mistakes. While we know bugs can exist, it's not the expectation. And, with the current generation of AI in particular, we are dealing with models that are generally probabilistic, which means there's not even the expectation that they are errorless.

And, I don't believe it's reasonable to expect people to give up control to AI of this quality, particularly in matters of safety or life and death; really anything that matters.

TLDR; Most people don't want to gamble their lives on a statistic, when the alternative is maintaining control.

chaorace2y ago

Expanding on this, human failures and machine failures are qualitatively different in ways that make our systems generally less resilient against the machine variety, even when dealing with a theoretically near-perfect implementation. Consider a bug in an otherwise perfect self-driving car routine that causes crashes under a highly specific scenario -- roads are essentially static structures, so you've effectively concentrated 100% of crashes into (for example) 1% of corridors. Practically speaking, those corridors would be forced into a state of perpetual closure.

This is all to say that randomly distributed failures are more tolerable than a relatively smaller number of concentrated failures. Human errors are rather nice by comparison because they're inconsistent in locality while still being otherwise predictable in macroscopic terms (e.g.: on any given day, there will always be far more rear-endings than head-on collisions). When it comes to machine networks, all it takes is one firmware update for both the type & locality of their failure modes to go into a wildly different direction.

tifik2y ago

What you say is true, and I agree, but that is the emotional human side of thinking. Purely logically, it would nake sense to compare the two systems of control and use the one with fewer human casualities. Not saying its gonna happen, just thinking that reason and logic should take precedent, no matter what side you are on.

1 more reply

DeathArrow2y ago

>I hear these complaints and can't see how this is worse than the pre-AI situation. How is an AI "hallucination" different from human-generated works that are just plain wrong, or otherwise misleading?

With humans there is a chance you get things right.

bananapub2y ago

> How is an AI "hallucination" different from human-generated works that are just plain wrong, or otherwise misleading?

yikes, mate, you've really misunderstood what's happening.

when a human fucks up, a human has fucked up. you can appeal to them, or to their boss, or to their CEO.

the way these crappy "AI" systems are being deployed, there is no one to appeal to and no process for unfucking things.

yes, this is not exactly caused by AI, it's caused by sociopaths operating businesses and governments, but the extent to which this enabled them and their terrible disdain for the world is horrifying.

this is already happening, of course - Cathy O'Neil wrote "Weapons Of Math Destruction" in 2016, about how unreviewable software systems were screwing people, from denying poor people loans to harsher sentencing for minority groups, but Sam Altman and the new generation of AI grifters now want this to apply to everything.

rolandog2y ago

> or make surgery decisions?

  Analyzing surgical field...
  Identified: open chest cavity, exposed internal organs
  Organs appear gooey, gelatinous, translucent pink
  Comparing to database of aquatic lifeforms...
  93% visual match found:
  Psychrolutes marcidus, common name "blobfish"
  Conclusion: Blobfish discovered inhabiting patient's thoracic cavity
  Recommended action: Attempt to safely extract blobfish without damaging organs

1 more reply

GuardianCaveman2y ago

I was in a counter-intelligence unit briefly and there was a mathemtician who spoke to us about the work they were doing to pick targets with the idea that if you can only out one person, who would be the most disruptive. You have all these interconnected but mostly isolated terrorist cells that don't know about each other except through a few people who may not be high up in the command but who are critical for the continuing cohesive existence of that terrorist group of cells and logistics etc.

So the military already was using math to pick targets, this is just the next logical step, albeit, scary as hell step.

jspank2y ago

In your scenario there were still individuals accountable for the decisions and their outcomes.

How are you supposed to say why a machine learning model produces different outputs from the same input? It's just a black box.

antihero2y ago

It is being used to pick military targets, with very little oversight.

https://www.972mag.com/lavender-ai-israeli-army-gaza/

Arn_Thor2y ago

If any regulator acts it will be the EU. The action, if it comes, will of course be very late, possibly years from now, when the horse has long left the stable.

sschueller2y ago

My only hope for the EU government is that they put and AI in charge and it accidentally becomes sentient...

HarHarVeryFunny2y ago

Israel is already doing exactly that... using AI to identify potential targets based on their network of connections, giving these potential targets a cursory human screening, then OK-ing the bombing of their entire family since they have put such faith (and/or just don't care) in this identification process that these are considered high-value targets where "collateral damage" is accepted.

1 more reply

Dumblydorr2y ago

Surgeons don’t need a text based LLM to make decisions. They have a job to do and a dozen years of training into how to do it. They have 8 years of schooling and 4-6 years internship and residency. The tech fantasy that everyone is using these for everything is a bubble thought. I agree with another comment, this is Doomerism.

CuriouslyC2y ago

Surgeons are using robots that are far beyond fly by wire though, to the point that you could argue they're instructing the robots rather than controlling them.

ComplexSystems2y ago

Why would the military use ChatGPT or depend on any way on Openai 's policy? Wouldn't they just roll their own?

fragmede2y ago

OpenAI is. Their TOS says don't use it for that kind of shit.

https://openai.com/policies/usage-policies/

tsimionescu2y ago

That's the license for the public service. Nothing prevents them from selling it as a separate package deal to an army.

hehdhdjehehegwv2y ago

Right now insurance companies make those decisions based on how your life affects the profit/loss statement at the end of the quarter. (In the USA).

So it can’t really be worse if there’s just a RNG in a box. It may be better.

ethbr12y ago

I get a good chuckle every morning when the "C3.ai" ad rolls on NPR.

"Hallucination-free," indeed.

Would love to know what actual, contractual guarantees they place around that.

histories2y ago

> OpenAI is selling this a a tutor for your kids.

The Diamond Age.

ipsin2y ago

That's what I find most offensive about the use of LLMs in education: it can readily produce something in the shape of a logical argument, without actually being correct.

I'm worried that a generation might learn that that's good enough.

Kostchei2y ago

a generation of consultants is already doing that- look at the ruckus around PWC etc in Australia. Hell, look at the folks supposedly doing diligence on Enron. This is not new. People lie, fib and prevaricate. The fact the machines trained on our actions do the same thing should not come as a shock. If anything it strikes me as the uncanny valley of truthiness.

bobosha2y ago

https://pessimistsarchive.org/

chazeon2y ago

It seems US and China are trying to reach an agreement to use AI to pick military targets these days.

farmdve2y ago

Next it's going to teach them the Earth is flat and there are aliens behind the moon.

DeathArrow2y ago

>Who is going to prevent this from being used to pick military targets

When AI is in charge of controlling weapons, you get this: https://www.accessnow.org/publication/artificial-genocidal-i...

gdubs2y ago

While this is clearly a problem and a challenge to address, the thing that never gets mentioned with this line of criticism is the obvious: a large number of real-life teachers make mistakes ALL the time. They harbor wrong / out-dated opinions, or they're just flat-out wrong about things.

nvarsj2y ago

I’ve had coworkers suggest a technical solution that was straight up fabricated by an LLM and made no sense. More competent people realise this limitation of the models and can use them wisely. Unfortunately I expect to see the former spread.

meindnoch2y ago

I've spent a few hours last week crafting a piece of code for my coworker, and then when I asked him to test it in the real environment, it turned out that the API he wanted to connect to the code I gave him was just a hallucination by ChatGPT.

denvrede2y ago

We had a manager joining last year which, on their first days, created MRs for existing code bases, wrote documents on new processes and gave advice on current problems we were facing. Everything was created by LLMs and plain bullshit. Fortunately were able to convince the higher ups that this person was an imposter and we got rid of them.

I really hope that these type of situations won't increase because the mental strain that put on some people in the org is not sustainable in the long run.

booleandilemma2y ago

People aren't dumb. They'll catch on pretty quick that this thing is BS'ing them.

causality02y ago

I don't understand OpenAI's pricing strategy. For free I can talk to GPT 3.5 on an unlimited basis, and a little to GPT 4o. If I pay $20 a month, I can talk to GPT 4o eighty times every three hours, or once every two and a half minutes. That's both way more than I need, and way less than I would expect for twenty dollars a month. I wish they had a $5 per month tier that included, say, eighty messages per 24-hours.

hackerlight2y ago

It'll make more sense when they deploy audio and image capability to paying users only, which they say they're going to do in a few weeks

causality02y ago

Yeah, but I want a tier where I have access to it in a pinch, but won't feel guilty for spending the money and then going a whole month without using it.

olddustytrail2y ago

Guilty? Over $20 a month? I spend more than that in an hour down the pub.

1 more reply

whereismyacc2y ago

I always thought it seemed likely that most needle in a haystack tests might run into the issue of the model just encoding some idea of 'out of place-ness' or 'significance' and querying on that, rather than actually saying something meaningful about generalized retrieval capabilities. Does that seem right? Is that the motivation for this test?

tartrate2y ago

Are there any prompts/tests about recalling multiple needles (spread out) at once?

For example, each needle could be a piece to a logic puzzle.

ammar_x2y ago

The article compares GPT-4o to Sonnet from Anthropic. I'm wondering how Opus would perform at this test?

throw73812y ago

Anyone has done any benchmarks for RAG yet?

ionwake2y ago

I am in England, do US users have access to memory features? ( Also do you ahve access to voice customisation yet?

Thanks

rob1372y ago

I am in England, on the 'Team Plan'* and got access to memory this week.

* https://openai.com/index/introducing-chatgpt-team/

ionwake2y ago

Thank you!

sumedh2y ago

memory features are available in Australia.

nickca2y ago

Would love to see Gemini there too!

cararemixed2y ago

What's the chance that these limericks are now in the training set? As others mention, it'd be interesting to come up with a way to synthesize something sufficiently interesting so it always evades training fit.

sftombu2y ago

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419

causal2y ago

Your test is a good one but the point still stands that a novel dataset is the next step to being sure.

dontupvoteme2y ago

One could also programmatically (e.g. with nltk or spacy, replace nouns, named entities, etc) modify the dataset, even up to the point that every test run is unique.

You could also throw in vector similarity if you wanted to keep words as more synonyms or antonyms.

asadm2y ago

I have had good experience with Gemini 1M context model with this kind of tasks.

croes2y ago

>Needle in a Needlestack is a new benchmark to measure how well LLMs pay attention to the information in their context window

I asked GPT-4o for JavaScript code and got Python, so much for attention.

kolinko2y ago

What was your query?

rguptill2y ago

We also need a way to determine where a given response fits in the universe of responses - is it an “average” answer or a really good one

edmara2y ago

If you have an evaluation function which does this accurately and generalizes, you pretty much already have have AGI.

m3kw92y ago

One could have LLM to route it to a text search function and have the function report back to the LLM for secondary processing.

dmose22y ago

It's interesting (though perhaps not surprising) to see the variance in curve shape across models.

m3kw92y ago

I thought google Gemini had almost perfect needle in haystack performance inside 1 million tokens?

sftombu2y ago

The reason I made Needle in a needlestack is the LLMs are getting to good at needle in a haystack. Until GPT-4o, no model was good at the NIAN benchmark.

DeathArrow2y ago

I wonder how llama3 is doing.

pojzon2y ago

Meh still for a lot of stuff it simply lies.

Just today it lied to me about VRL language syntax, tryin to sell me some python stuff in there.

Senior ppl will often be able call out the bullshit, but I believe for junior ppl it will be very detrimental.

Nether the less amazing tool for d2d work if you can call out BS replies.

8thcross2y ago

These benchmarks are becoming like the top 10 lists you find on the internet. I agree that everything has a space, but frankly how many of us need a test that tells you that this is great at limericks?

EGreg2y ago

I think large language models can be used to classify people, lying, or saying, rehearsed, things or being disingenuous. Simply train them on a lot of audio of people talking, and they would become better than most polygraph machines. There’s something about how a person says something that quickly reveals that it was rehearsed earlier, or premeditated, and I’m sure when they’re lying there can be things like that too. the LLM can instantly pick up with some probability and classify it

I’ve seen claims during open AI demo that is there software can now pick up on extremely subtle emotional clues, how people speak. Then, it shouldn’t take much more to make it read between the lines and understand what people are intending to say, for example, by enumerating all possible interpretations and scoring them based on, many factors, including the current time, location, etc. In fact, by taking into account so much context in factors, the LLM‘s will be better than people the vast majority of the time understanding what a person meant, assuming they were genuinely trying to communicate something.

it will become very hard to lie because everyone’s personal LLM will pick up on it fairly quickly, and find tons of inconsistencies, which it will flag for you later. You will no longer be fooled so easily, and if it has the context of everything the person has said publicly, plus if the person gives permission for your LLM to scan everything they’ve said privately because you’re their Business partner or sexual partner, it can easily catch you in many lies and so on.

I predict that in the next 5 to 10 years, human society will completely change as people start to prefer machines to other people, because they understand them so well, and taken into account, the context of everything they’ve ever said. They will be thoughtful, remembering details about the person in many different dimensions, and use them to personalize everything. By contrast, the most thoughtful husband or boyfriend will seem like, a jerk seems now. Or a cat.

Humor and seductive conversation, will also be at a superhuman standards. People will obviously up their game too, just like when they do when playing the game go after Lee Sedol was totally destroyed by Alpha go, or when people start using Alpha Zarro to train for Chess. However, once the computers understand what triggers people to laugh or have sexual response, they will be able to trigger them a lot more predictively, they simply need more training data.

And bullshitting will be done on a completely different level. Just like people no longer walk to destinations but use cars to go thousands of miles a year, similarly people won’t interact with other people so much anymore. The LLM’s, trained to bullshit 1000 times better than any human, Will be undetectable and gradually shift public opinion as open source models will power swarms of accounts.

j / k navigate · click thread line to collapse

239 comments

irthomasthomas2y ago

This is based on a limericks dataset published in 2021. https://zenodo.org/records/5722527

I think it very likely that gpt-4o was trained on this. I mean, why would you not? Innnput, innnput, Johnny five need more tokens.

I wonder why the NIAN team don't generate their limericks using different models, and check to make sure they're not in the dataset? Then you'd know the models couldn't possibly be trained on them.

sftombu2y ago

cma2y ago

djsjajah2y ago

sftombu2y ago

These pursuits are shaped by the current political climate, global trends, and the specific priorities of the leaders in question. Would you like more detailed information on any of these areas?"

4 more replies

dontupvoteme2y ago

It would be interesting to know how it acts if you ask it about one that isn't present, or even lie to it (e.g. take a limerick that is present but change some words and ask it to complete it)

Maybe some models hallucinate or even ignore your mistake vs others correcting it (depending on the context ignoring or calling out the error might be the more 'correct' approach)

Using limericks is a very nifty idea!

neverokay2y ago

Why not just generate complete random stuff and ask it to find stuff in that?

Kostchei2y ago

dontupvoteme2y ago

NIAN is a very cool idea, but why not simply translate it into N different languages (you even can mix services, e.g. deepl/google translate/LLMs themselves) and ask about them that way?

internet1010102y ago

No disassemble!

bearjaws2y ago

I just used it to compare two smaller legal documents and it completely hallucinated that items were present in one and not the other. It did this on three discrete sections of the agreements.

Using ctrl-f I was able to see that they were identical in one another.

Obviously this is a single sample but saying 90% seems unlikely. They were around ~80k tokens total.

carlosbaraza2y ago

akomtu2y ago

Everyone is trying to use Language Models as Reasoning Models because the latter haven't been invented yet.

fnordpiglet2y ago

That’s not needle in a haystack.

HarHarVeryFunny2y ago

That's a different test than needle-in-a needlestack, although telling in how brittle these models are - competent in one area, and crushingly bad in others.

1970-01-012y ago

I've done the same experiment with local laws and caught GPT hallucinating fines and fees! The problem is real.

tmaly2y ago

Imagine if they started using LLMs to suggest prison sentences

Aerbil3132y ago

Interesting, because the (at least the official) context window of GPT-4o is 128k.

davedx2y ago

> Obviously this is a single sample but saying 90% seems unlikely.

This is such an anti-intellectual comment to make, can't you see that?

You mention "sample" so you understand what statistics is, then in the same sentence claim 90% seems unlikely with a sample size of 1.

The article has done substantial research

dkjaudyeqooe2y ago

That fact that it has some statistically significant performance is irrelevant and difficult to evaluate for most people.

He's a much simpler and correct description that almost everyone can understand: it fucks up constantly.

Getting something wrong even once can make it useless for most people. No amount of pedantry will change this reality.

davedx2y ago

This isn't pedantry, it's science.

lopuhin2y ago

bckr2y ago

KeplerBoy2y ago

Those models still can't reliably do arithmetic, so how could it possibly know that number unless it's a commonly repeated fact?

Also: would you expect random people to fare any better?

bckr2y ago

It used web search (RAG over the entire web) and analysis (math tool) and still came up with the wrong answer.

It has done more complex things for me than this and, sometimes, gotten it right.

Yes, it’s supposed to be able to do this.

chrischen2y ago

Arithmetic just happens to be something we can easily and reliably verify, so it becomes painfully obvious when LLMs are just stringing together some words that sound like the right answer.

kylebenzle2y ago

What you are asking an llm to do here makes no sense.

potatoman222y ago

Why not? It seems like a natural language understanding task

marshray2y ago

You haven't seen the promotion of the use of LM AI for handling legal documents?

It's purported to be a major use case.

cmrdporcupine2y ago

You might be right but I've lost count of the number of startups I've heard of trying to do this for legal documents.

thorum2y ago

RULER is a much better test:

https://github.com/hsiehjackson/RULER

WhitneyLand2y ago

Maybe, but

1. The article is not about NIHS it’s their own variation so it could be more relevant.

2. The whole claim of the article is that Gpt4o does better, but the test your pointing to hasn’t benchmarked it.

sftombu2y ago

The models benchmarked by RULER do worse in needle in a needlestack. It will be interested to see how 4o does with RULER.

19h2y ago

nsagent2y ago

See BooookScore (https://openreview.net/forum?id=7Ttk3RzDeu) which was just presented at ICLR last week and FABLES (https://arxiv.org/abs/2404.01261) a recent preprint.

theptip2y ago

I suppose the question then is - if you finetune on your own data (eg internal wiki) does it then retain the near-perfect recall?

Could be a simpler setup than RAG for slow-changing documentation, especially for read-heavy cases.

k__2y ago

"if you finetune on your own data (eg internal wiki) does it then retain the near-perfect recall"

No, that's one of the primary reasons for RAG.

1 more reply

robbiep2y ago

How far off am I?

int_19h2y ago

2 more replies

Salgat2y ago

Remember, it's also trained on countless internet discussions and papers on the book.

westurner2y ago

HN post re: FABLES: https://news.ycombinator.com/item?id=39982362

FABLES/booklist.md: https://github.com/mungg/FABLES/blob/main/booklist.md

/gscholar_citations? BoookScore: https://scholar.google.com/scholar?cites=1796862036168524911...

...

From that one day awhile ago: https://news.ycombinator.com/item?id=38347868#38354679 :

> "LLMs cannot find reasoning errors, but can correct them" [ https://arxiv.org/abs/2311.08516 ] https://news.ycombinator.com/item?id=38353285

Fernicia2y ago

But this content is presumably in its training set, no? I'd be interested if you did the same task for a collection of books published more recently than the model's last release.

19h2y ago

Screenshot of the PDF with the relevant sentence highlighted: https://i.imgur.com/G3FnYEn.png

[0] https://www.routledge.com/Advances-in-Green-and-Sustainable-...

jiggawatts2y ago

Ask it what material absorbs “infrared light” efficiently.

To me, that’s useful intelligence. I can already search text for verbatim matches, I want the AI to understand that “thermal radiations” and “infrared light” are the same thing.

2 more replies

kaibee2y ago

Honestly I think testing these on fiction books would be more impressive. The graphene thing I'm sure shows up in some research papers.

a_wild_dandan2y ago

ben_w2y ago

This doesn't mean you're wrong, though.

sebzim45002y ago

It's pretty easy to confirm that copywritten material is in the training data. See the NYT lawsuit against OpenAI for example.

1 more reply

DominikPeters2y ago

Just put the 2500 example linked on the article through Gemini 1.5 Flash and it answered correctly ("The tree has diseased leaves and its bark is peeling.") https://aistudio.google.com/

sftombu2y ago

Interesting!

parrtOP2y ago

19h2y ago

parrtOP2y ago

That definitely makes it seem like it's noticing a great deal of its context window. impressive.

causality02y ago

Man, we are like 2-5 years away from being able to feed in an ePub and get an accurate graphic novel version in minutes. I am so ready to look at four thousand paintings of Tolkien trees.

sftombu2y ago

If I had access to Gemini with a reasonable token rate limit, I would be happy to test Gemini. I have had good results with it in other situations.

cj2y ago

underlines2y ago

Such tasks don't need a large context window. Just good RAG.

youssefabdelm2y ago

Someone needs to come up with a "synthesis from haystack" test that tests not just retrieval but depth of understanding, connections, abstractions across diverse information.

When a person reads a book, they have an "overall intuition" about it. We need some way to quantify this. Needle in haystack tests feel like a simple test that doesn't go far enough.

jddj2y ago

An elaborate Agatha Christie style whodunit, with a series of plot-twists and alibis which can be chopped off the end of the piece to modify who is the most likely suspect

jddj2y ago

Or a spot the difference.

Generate 1000 generic facts about Alice and the same 1000 facts about Eve. Randomise the order and change one minor detail then ask how they differ.

youssefabdelm2y ago

That seems to go back in the direction of needle in the haystack again

pushedx2y ago

    sort alice.txt | diff - <(sort eve.txt)

That's not a task for an LLM

2 more replies

visarga2y ago

The needles form a graph and the prompt asks graph based tasks.

sftombu2y ago

That is an interesting idea

Eisenstein2y ago

semi-extrinsic2y ago

Just use memes. People generate new high-quality niche memes so fast it's impossible for the LLMs to keep up.

visarga2y ago

You can only use it for a short while, they get a copy as well.

Eisenstein2y ago

I have been thinking about this for use in evaluating locally run models, so I didn't make that connection in this case. I guess it would have limited utility.

sftombu2y ago

borgdefense2y ago

There is no understanding, it can't do this.

GPT4o still can't do the intersection of two different ideas that are not in the training set. It can't even produce random variations on the intersection of two different ideas.

nebula88042y ago

adamgordonbell2y ago

I've been thinking about that as well.

"What does this work says about our culture? Support your answer with direct quotes."

Interpretation is hard I guess.

1 more reply

segmondy2y ago

Why can't you be that someone?

gremlinsinc2y ago

lol, made me think of the euphemism: be the change you want to see.

yatz2y ago

ijidak2y ago

Do you have an example of this? I would love to learn more.

yatz2y ago

Here is the entire prompt. I used rules to ensure the formatting is consistent as otherwise sometimes it might format date one way and other times in an entirely different way.

balder19912y ago

I guess you just need to offer a template in the prompt? Then maybe some validation after.

yatz2y ago

No templates, just some rules and the model does the rest. It worked like a charm, even gave me ideas on how to layout and format the page to make it easy to read.

parrtOP2y ago

The article shows how much better GPT-4o is at paying attention across its input window compared to GPT-4 Turbo and Claude-3 Sonnet.

mianos2y ago

I used 4o last night and it was still perfectly aware of a C++ class I pasted 20 questions ago. I don't care about smart, I care about useful and this really contributes to the utility.

whimsicalism2y ago

Increasingly convinced that nobody on the public internet knows how to do actual LLM evaluations.

tedeh2y ago

I'm just glad that we are finally past the "Who was the 29th president of the United States" and "Draw something in the style of Van Gogh" LLM evaluation test everyone did in 2022-2023.

petulla2y ago

You need to know that this test set data wasn't included in the training data for this to be meaningful.

sftombu2y ago

trifurcate2y ago

a_wild_dandan2y ago

No you don't. Compare the model's performance before and after uploading the material.

sftombu2y ago

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419s

sumedh2y ago

No such item.

lmeyerov2y ago

I thought the test limericks were autogenerated?

sftombu2y ago

They come from a database of 98k limericks -- https://zenodo.org/records/5722527

personjerry2y ago

That's great to hear. My biggest issue with GPT-4.0 was that as the conversation got longer, the quality diminished (especially relevant for coding projects)

I wonder if it'll be better now. Will test today.

throwthrowuknow2y ago

That’s been my experience so far. My current conversations are crazy long compared to any of my gpt4 convos which I had to frequently copy context from and start over in a new chat

sftombu2y ago

I had the same experience. With a 16k prompt, Turbo was nearly flawless. But it wasn't very good at 32k and not usable at 100+. You have to repeat information to get good results with longer prompts

itissid2y ago

How Do we know that gpt-4o.has not been trained on this dataset?

sftombu2y ago

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419

throwthrowuknow2y ago

demilich2y ago

Agreed

feverzsj2y ago

LLMs are still toys, no one should treat them seriously. Apparently, the bubble is too massive now.

infecto2y ago

We have businesses getting real value from these toys. Maybe you have not been in the right circles to experience this?

feverzsj2y ago

Of course you can get value from toy business, but toys are toys.

1 more reply

nopromisessir2y ago

Used toys to write a working machine vision project over last 2 days.

Key word: working

SiempreViernes2y ago

> The time saved is hard to even measure. It's big.

You are aware that this is an obvious contradiction, right? Big times savings are not hard to measure.

nopromisessir2y ago

Right... With precision...

Furthermore... big mountains are easier to weigh v small individual atoms? I think it's a little more complicated than big is easy to measure...

I care little about the precision... I've got other priorities. It's the same as the time the internet saves me... Big. It's obvious.

I stand by my statement. It's hard to measure...

cdelsolar2y ago

Must be a pretty cool toy; it constantly 10X’s my productivity.

nopromisessir2y ago

You said it mate. I feel bad for folks who turn away from this technology. If they persist... They will be so confused why they get repeatedly lapped.

I wrote a working machine vision project in 2 days with these toys. Key word: working... Not hallucinated. Actually working. Very useful.

SubiculumCode2y ago

1 more reply

davedx2y ago

I just don't understand why AI is so polarising on a technology website.

1 more reply

SiempreViernes2y ago

Without details that's a meaningless stat, I remember some pytorch machine vision tutorials promising they'll only take like an hour, including training and also gives a working project at the end.

davedx2y ago

People are deliberately self selecting themselves out of the next industrial revolution. It's Darwin Awards for SWE careers. It's making me ranty.

sschueller2y ago

We are all so majorly f*d.

The general public does not know nor understand this limitation. At the same time OpenAI is selling this a a tutor for your kids. Next it will be used to test those same kids.

Who is going to prevent this from being used to pick military targets (EU law has an exemption for military of course) or make surgery decisions?

kromokromo2y ago

Using LLMs for picking military targets is just absurd. In the future, someone might use some other variation of AI for this but LLMs are not very effective on this.

dbspin2y ago

AI is already being used for picking targets in warzones - https://theconversation.com/israel-accused-of-using-ai-to-ta....

throwthrowuknow2y ago

2 more replies

mike_hearn2y ago

Note that the IDF explicitly denied that story:

https://www.idf.il/en/mini-sites/hamas-israel-war-24/all-art...

4 more replies

wolfd2y ago

It’s absurd but LLMs for military targets is absolutely something that some companies are trying to sell regardless of the many known failure modes.

https://www.bloomberg.com/news/newsletters/2023-07-05/the-us...

https://youtu.be/XEM5qz__HOU

goopthink2y ago

lhoff2y ago

>Using LLMs for picking military targets is just absurd. In the future

I guess the future is now then: https://www.theguardian.com/world/2023/dec/01/the-gospel-how...

Excerpt:

>Aviv Kochavi, who served as the head of the IDF until January, has said the target division is “powered by AI capabilities” and includes hundreds of officers and soldiers.

>In an interview published before the war, he said it was “a machine that produces vast amounts of data more effectively than any human, and translates it into targets for attack”.

agos2y ago

nothing in this says they used an LLM

2 more replies

coldtea2y ago

>Using LLMs for picking military targets is just absurd

You'd be surprised.

Not to mention it's also used for military and intelligence "analysis".

>using an LLM for high risk tasks like healthcare and picking targets in military operations still feels very far away

When immaturity and unfitness for purpose has ever stopped companies selling crap?

exe342y ago

> picking targets in military operations

mlnj2y ago

'Quotas each day to find targets to kill'.

That's a brilliant and sustainable strategy. /s

ExoticPearTree2y ago

currymj2y ago

CWuestefeld2y ago

UncleMeat2y ago

"Sorry, computer says no."

Humans can be wrong, but they aren't able to be wrong at as massive of a scale and they often have an override button where you can get them to look at something again.

When you have an AI deployed system and full automation you've got more opportunities for "I dunno, the AI says that you are unqualified for this job and there is no way around that."

skywhopper2y ago

CWuestefeld2y ago

Pretty much every element of the above statements is false. Heck, either your response to me, or this reply, seem to be examples showing that the first one is wrong.

Sophira2y ago

Society has spent literal decades being convinced to put their trust in everything computers do. We're now at the point that, in general, that trust is there and isn't misplaced.

[0] https://i.imgur.com/6wbgy2L.jpeg

lnxg33k12y ago

Probably the main difference is that humans fail at smaller scale, with smaller effects, and build a reputation, probably chatgpt hallucinations can potentially affect everyone

moralestapia2y ago

Humans know when they've made a mistake. So there's ways to deal with that.

Computers are final. You don't want things to be final when your life's on the line.

olddustytrail2y ago

> Humans know when they've made a mistake.

You'll never make senior management with that attitude. At worst, "mistakes were made" and look a bit sad.

unclebucknasty2y ago

>There's no fundamental qualitative difference here...degree either.

I've heard the same comparisons made with self-driving cars (i.e. that humans are fallible, and maybe even more error-prone).

And, I don't believe it's reasonable to expect people to give up control to AI of this quality, particularly in matters of safety or life and death; really anything that matters.

TLDR; Most people don't want to gamble their lives on a statistic, when the alternative is maintaining control.

chaorace2y ago

tifik2y ago

1 more reply

DeathArrow2y ago

With humans there is a chance you get things right.

bananapub2y ago

> How is an AI "hallucination" different from human-generated works that are just plain wrong, or otherwise misleading?

yikes, mate, you've really misunderstood what's happening.

when a human fucks up, a human has fucked up. you can appeal to them, or to their boss, or to their CEO.

the way these crappy "AI" systems are being deployed, there is no one to appeal to and no process for unfucking things.

rolandog2y ago

> or make surgery decisions?

  Analyzing surgical field...
  Identified: open chest cavity, exposed internal organs
  Organs appear gooey, gelatinous, translucent pink
  Comparing to database of aquatic lifeforms...
  93% visual match found:
  Psychrolutes marcidus, common name "blobfish"
  Conclusion: Blobfish discovered inhabiting patient's thoracic cavity
  Recommended action: Attempt to safely extract blobfish without damaging organs

1 more reply

GuardianCaveman2y ago

So the military already was using math to pick targets, this is just the next logical step, albeit, scary as hell step.

jspank2y ago

In your scenario there were still individuals accountable for the decisions and their outcomes.

How are you supposed to say why a machine learning model produces different outputs from the same input? It's just a black box.

antihero2y ago

It is being used to pick military targets, with very little oversight.

https://www.972mag.com/lavender-ai-israeli-army-gaza/

Arn_Thor2y ago

If any regulator acts it will be the EU. The action, if it comes, will of course be very late, possibly years from now, when the horse has long left the stable.

sschueller2y ago

My only hope for the EU government is that they put and AI in charge and it accidentally becomes sentient...

HarHarVeryFunny2y ago

1 more reply

Dumblydorr2y ago

CuriouslyC2y ago

Surgeons are using robots that are far beyond fly by wire though, to the point that you could argue they're instructing the robots rather than controlling them.

ComplexSystems2y ago

Why would the military use ChatGPT or depend on any way on Openai 's policy? Wouldn't they just roll their own?

fragmede2y ago

OpenAI is. Their TOS says don't use it for that kind of shit.

https://openai.com/policies/usage-policies/

tsimionescu2y ago

That's the license for the public service. Nothing prevents them from selling it as a separate package deal to an army.

hehdhdjehehegwv2y ago

Right now insurance companies make those decisions based on how your life affects the profit/loss statement at the end of the quarter. (In the USA).

So it can’t really be worse if there’s just a RNG in a box. It may be better.

ethbr12y ago

I get a good chuckle every morning when the "C3.ai" ad rolls on NPR.

"Hallucination-free," indeed.

Would love to know what actual, contractual guarantees they place around that.

histories2y ago

> OpenAI is selling this a a tutor for your kids.

The Diamond Age.

ipsin2y ago

That's what I find most offensive about the use of LLMs in education: it can readily produce something in the shape of a logical argument, without actually being correct.

I'm worried that a generation might learn that that's good enough.

Kostchei2y ago

bobosha2y ago

https://pessimistsarchive.org/

chazeon2y ago

It seems US and China are trying to reach an agreement to use AI to pick military targets these days.

farmdve2y ago

Next it's going to teach them the Earth is flat and there are aliens behind the moon.

DeathArrow2y ago

>Who is going to prevent this from being used to pick military targets

When AI is in charge of controlling weapons, you get this: https://www.accessnow.org/publication/artificial-genocidal-i...

gdubs2y ago

nvarsj2y ago

meindnoch2y ago

denvrede2y ago

I really hope that these type of situations won't increase because the mental strain that put on some people in the org is not sustainable in the long run.

booleandilemma2y ago

People aren't dumb. They'll catch on pretty quick that this thing is BS'ing them.

causality02y ago

hackerlight2y ago

It'll make more sense when they deploy audio and image capability to paying users only, which they say they're going to do in a few weeks

causality02y ago

Yeah, but I want a tier where I have access to it in a pinch, but won't feel guilty for spending the money and then going a whole month without using it.

olddustytrail2y ago

Guilty? Over $20 a month? I spend more than that in an hour down the pub.

1 more reply

whereismyacc2y ago

tartrate2y ago

Are there any prompts/tests about recalling multiple needles (spread out) at once?

For example, each needle could be a piece to a logic puzzle.

ammar_x2y ago

The article compares GPT-4o to Sonnet from Anthropic. I'm wondering how Opus would perform at this test?

throw73812y ago

Anyone has done any benchmarks for RAG yet?

ionwake2y ago

I am in England, do US users have access to memory features? ( Also do you ahve access to voice customisation yet?

Thanks

rob1372y ago

I am in England, on the 'Team Plan'* and got access to memory this week.

* https://openai.com/index/introducing-chatgpt-team/

ionwake2y ago

Thank you!

sumedh2y ago

memory features are available in Australia.

nickca2y ago

Would love to see Gemini there too!

cararemixed2y ago

sftombu2y ago

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419

causal2y ago

Your test is a good one but the point still stands that a novel dataset is the next step to being sure.

dontupvoteme2y ago

One could also programmatically (e.g. with nltk or spacy, replace nouns, named entities, etc) modify the dataset, even up to the point that every test run is unique.

You could also throw in vector similarity if you wanted to keep words as more synonyms or antonyms.

asadm2y ago

I have had good experience with Gemini 1M context model with this kind of tasks.

croes2y ago

>Needle in a Needlestack is a new benchmark to measure how well LLMs pay attention to the information in their context window

I asked GPT-4o for JavaScript code and got Python, so much for attention.

kolinko2y ago

What was your query?

rguptill2y ago

We also need a way to determine where a given response fits in the universe of responses - is it an “average” answer or a really good one

edmara2y ago

If you have an evaluation function which does this accurately and generalizes, you pretty much already have have AGI.

m3kw92y ago

One could have LLM to route it to a text search function and have the function report back to the LLM for secondary processing.

dmose22y ago

It's interesting (though perhaps not surprising) to see the variance in curve shape across models.

m3kw92y ago

I thought google Gemini had almost perfect needle in haystack performance inside 1 million tokens?

sftombu2y ago

The reason I made Needle in a needlestack is the LLMs are getting to good at needle in a haystack. Until GPT-4o, no model was good at the NIAN benchmark.

DeathArrow2y ago

I wonder how llama3 is doing.

pojzon2y ago

Meh still for a lot of stuff it simply lies.

Just today it lied to me about VRL language syntax, tryin to sell me some python stuff in there.

Senior ppl will often be able call out the bullshit, but I believe for junior ppl it will be very detrimental.

Nether the less amazing tool for d2d work if you can call out BS replies.

8thcross2y ago

EGreg2y ago

j / k navigate · click thread line to collapse