Hard problems that reduce to document ranking (opens in new tab)

(noperator.dev)

318 pointsnoperator1y ago54 comments

54 comments

The open source ranking library is really interesting. It's using a type of merge sort where the comparator function is an llm comparing (but doing batches >2 for fewer calls).

Reducing problems to document ranking is effectively a type of test-time search - also very interesting!

I wonder if this approach could be combined with GRPO to create more efficient chain of thought search...

https://github.com/BishopFox/raink?tab=readme-ov-file#descri...

rahimnathwani1y ago

The article introducing the library has something about how pairwise comparisons are most reliable (i.e. for each pair of items you ask an LLM which they prefer) but computationally expensive. Doing a single LLM call (rank these items in order) is much less reliable. So they do something in between that gives enough pairwise comparisons to have a more reliable list.

https://news.ycombinator.com/item?id=43175658

antirez1y ago

One interesting thing about LLMs, that is also related to why chain of thoughts work so well, is that they are good at sampling (saying a lot of things about a problem), and are good, when shown N solutions, to point at the potentially better one. They do these things better than zero-shot "tell me how to do that". So CoT is searching inside the space of representation + ranking, basically. So this idea is leveraging something LLMs are able to clearly do pretty well.

hexator1y ago

This furthers an idea I've had recently that we (and the media) are focusing too much on creating value by making more ever more complex LLMs, and instead we are vastly underestimating creative applications of current generation AI.

noperatorOP1y ago

Agree. I think LLMs are usually not "harnessed" correctly for complex, multi-step problems—hence the `raink` CLI tool: https://github.com/noperator/raink

1 more reply

crazygringo1y ago

Why not both?

The LLM companies work on the LLMs, while tens of thousands of startups and established companies work on applying what already exists.

It's not either/or.

barrenko1y ago

Currently we are using mllms like lego blocks to build lego-powered-like devices.

noperatorOP1y ago

A concept that I've been thinking about a lot lately: transforming complex problems into document ranking problems to make them easier to solve. LLMs can assist greatly here, as I demonstrated at inaugural DistrictCon this past weekend.

lifeisstillgood1y ago

So would this be 1600 commits and one of which fixes the bug (which might be easier with commit messages?) or is this a diff between two revisions, with 1600 chunks, each chunk a “document” ?

I am trying to grok why we want to find the fix - is it to understand what was done so we can exploit unpatched instances in the wild?

Also also

“identifying candidate functions for fuzzing targets“ - if every function is a document I get where the list of documents is, what what is the query - how do I say “find me a function most suitable to fuzzing”

Apologies if that’s brusque - trying to fit new concepts in my brain :-)

noperatorOP1y ago

Great questions. For commits or revision diffs as documents—either will work. Yes, I've applied this to N-day vulnerability identification to support exploit development and offensive security testing. And yes, for fuzzing, a sensible approach would be to dump the exported function attributes (names, source/disassembled code, other relevant context, etc.) from a built shared library, and ask, "Which of these functions most likely parses complex input and may be a good candidate for fuzzing?" I've had some success with that specific approach already.

rfurmani1y ago

Very cool! This is also one of my beliefs in building tools for research, that if you can solve the problem of predicting and ranking the top references for a given idea, then you've learned to understand a lot about problem solving and decomposing problems into their ingredients. I've been pleasantly surprised by how well LLMs can rank relevance, compared to supervised training of a relevancy score. I'll read the linked paper (shameless plug, here it is on my research tools site: https://sugaku.net/oa/W4401043313/)

mskar1y ago

Great article, I’ve had similar findings! LLM based “document-chunk” ranking is a core feature of PaperQA2 (https://github.com/Future-House/paper-qa) and part of why it works so well for scientific Q&A compared to traditional embedding-ranking based RAG systems.

noperatorOP1y ago

That's awesome. Will take a closer look!

tbrownaw1y ago

So instead of testing each patch, it's faster to "read" it and see if it looks like the right kind of change to be fixing a particular bug. Neat.

adamkhakhar1y ago

I'm curious - why is LLM ranking preferred over cosine similarity from an embedding model (in the context of this specific problem)?

panarky1y ago

Because the question "does Diff A fix Vuln B" is not answered by the cosine distance between vector(Diff A) and vector(Vuln B).

janalsncm1y ago

You can learn a function that embeds diffs with vulnerability A near each other, and vulnerability B near each other, etc which is much more efficient than asking an LLM about hundreds of chunks one at a time.

Maybe you even use the LLM to find vulnerable snippets at the beginning, but a multi class classifier or embedding model will be way better at runtime.

telotortium1y ago

Perhaps you can learn such a function, but it may be hard to learn a suitable embedding space directly, so it makes sense to lean on the more general capabilities of an LLM model (perhaps fine-tuned and distilled for more efficiency).

1 more reply

jasonjmcghee1y ago

I've thought about this and am very interested in this problem. Specifically, how can you efficiently come up with a kernel function that maps a "classic" embedding space to answer a specific ranking problem?

With enough data, you could train a classic ml model, or you could keep the llm in the inference pipeline, but is there another way?

1 more reply

adamkhakhar1y ago

"does Diff A fix Vuln B" is not the ranking solution proposed by the author. the ranking set-up is the same as the embedding case.

daralthus1y ago

What if u embed `askLLM("5 thins this Diff could fix" + chunk)` instead of `chunk`? That should be closer in the latent space.

patapong1y ago

Interesting insight, and funny in a way since LLMs themselves can be seen as a specific form of document ranking, i.e. ranking a list of tokens by appropriateness as continuation of a text sequence.

westurner1y ago

Ranking (information retrieval) https://en.wikipedia.org/wiki/Ranking_(information_retrieval...

awesome-generative-information-retrieval > Re-ranking: https://github.com/gabriben/awesome-generative-information-r...

Everdred2dx1y ago

Very interesting application of LLMs. Thanks for sharing!

jasonjmcghee1y ago

I see in the readme you investigated tournament style, but didn't see results.

How'd it perform compared to listwise?

Also curious about whether you tried schema-based querying to the llm (function calling / structured output). I recently tried to have a discussion about this exact topic with someone who posted about pairwise ranking with llms.

https://lobste.rs/s/yxlisx/llm_sort_sort_input_lines_semanti...

marcosdumay1y ago

Hum... The gotcha is that LLMs can rank for subject relevance, but not for most other kinds of quality.

ambicapter1y ago

What other kinds of quality are you thinking of?

o11c1y ago

I'll be happy when I meet an LLM that doesn't randomly inject/ignore the word "not".

m3kw91y ago

That title hurts my head to read

moralestapia1y ago

Minor nitpick,

Should be "document ranking reduces to these hard problems",

I never knew why the convention was like that, it seems backwards to me as well, but that's how it is.

dwringer1y ago

"Document ranking reduces to these hard problems" would imply that document ranking is itself an instance of a certain group of hard problems. That's not what the article is saying.

moralestapia1y ago

I know its counterintuitive, as I explained in my comment, but that's the correct terminology in CS world.

markerz1y ago

I want to hear more about your point of view, because I disagree and am curious if there's another definition of "reduce". In my CS world, reduce is a term that you use to take a list of stuff and return a smaller list or instance of stuff. For example: [1, 2, 3].reduce(+) => 6. The title would go like [hardProblem1, hardProblem2, hardProblem3].reduce(...) => documentRanking. I think this mental model works for the non-CS world. So I'm curious what your viewpoint is.

1 more reply

Ar-Curunir1y ago

If A reduces to B, it means that an algorithm implementing B can be used (with some pre- and post-processing) to solve A.

If A reduces to B, it means that B is at least as hard as A.

This is the standard terminology in every theoretical computer science; see for example the DPV textbook on page 210: https://github.com/eherbold/berkeleytextbooks/blob/master/Al...

1 more reply

bubblyworld1y ago

Not quite - in complexity theory you say problem A reduces to problem B if an oracle for problem B can be used to solve problem A. So the title of the article is correct, as an oracle for document ranking (LLMs in this case) can be used to solve a list of hard problems (given in the article).

moralestapia1y ago

Wrong.

At least bother to read the discussion in the sibling comments.

bubblyworld1y ago

It's standard terminology. I'm not going to waste time arguing about it.

1 more reply

j / k navigate · click thread line to collapse

54 comments

obblekk1y ago

The open source ranking library is really interesting. It's using a type of merge sort where the comparator function is an llm comparing (but doing batches >2 for fewer calls).

Reducing problems to document ranking is effectively a type of test-time search - also very interesting!

I wonder if this approach could be combined with GRPO to create more efficient chain of thought search...

https://github.com/BishopFox/raink?tab=readme-ov-file#descri...

rahimnathwani1y ago

https://news.ycombinator.com/item?id=43175658

antirez1y ago

hexator1y ago

noperatorOP1y ago

Agree. I think LLMs are usually not "harnessed" correctly for complex, multi-step problems—hence the `raink` CLI tool: https://github.com/noperator/raink

1 more reply

crazygringo1y ago

Why not both?

The LLM companies work on the LLMs, while tens of thousands of startups and established companies work on applying what already exists.

It's not either/or.

barrenko1y ago

Currently we are using mllms like lego blocks to build lego-powered-like devices.

noperatorOP1y ago

lifeisstillgood1y ago

So would this be 1600 commits and one of which fixes the bug (which might be easier with commit messages?) or is this a diff between two revisions, with 1600 chunks, each chunk a “document” ?

I am trying to grok why we want to find the fix - is it to understand what was done so we can exploit unpatched instances in the wild?

Also also

Apologies if that’s brusque - trying to fit new concepts in my brain :-)

noperatorOP1y ago

rfurmani1y ago

mskar1y ago

noperatorOP1y ago

That's awesome. Will take a closer look!

tbrownaw1y ago

So instead of testing each patch, it's faster to "read" it and see if it looks like the right kind of change to be fixing a particular bug. Neat.

adamkhakhar1y ago

I'm curious - why is LLM ranking preferred over cosine similarity from an embedding model (in the context of this specific problem)?

panarky1y ago

Because the question "does Diff A fix Vuln B" is not answered by the cosine distance between vector(Diff A) and vector(Vuln B).

janalsncm1y ago

Maybe you even use the LLM to find vulnerable snippets at the beginning, but a multi class classifier or embedding model will be way better at runtime.

telotortium1y ago

1 more reply

jasonjmcghee1y ago

With enough data, you could train a classic ml model, or you could keep the llm in the inference pipeline, but is there another way?

1 more reply

adamkhakhar1y ago

"does Diff A fix Vuln B" is not the ranking solution proposed by the author. the ranking set-up is the same as the embedding case.

daralthus1y ago

What if u embed `askLLM("5 thins this Diff could fix" + chunk)` instead of `chunk`? That should be closer in the latent space.

patapong1y ago

Interesting insight, and funny in a way since LLMs themselves can be seen as a specific form of document ranking, i.e. ranking a list of tokens by appropriateness as continuation of a text sequence.

westurner1y ago

Ranking (information retrieval) https://en.wikipedia.org/wiki/Ranking_(information_retrieval...

awesome-generative-information-retrieval > Re-ranking: https://github.com/gabriben/awesome-generative-information-r...

Everdred2dx1y ago

Very interesting application of LLMs. Thanks for sharing!

jasonjmcghee1y ago

I see in the readme you investigated tournament style, but didn't see results.

How'd it perform compared to listwise?

https://lobste.rs/s/yxlisx/llm_sort_sort_input_lines_semanti...

marcosdumay1y ago

Hum... The gotcha is that LLMs can rank for subject relevance, but not for most other kinds of quality.

ambicapter1y ago

What other kinds of quality are you thinking of?

o11c1y ago

I'll be happy when I meet an LLM that doesn't randomly inject/ignore the word "not".

m3kw91y ago

That title hurts my head to read

moralestapia1y ago

Minor nitpick,

Should be "document ranking reduces to these hard problems",

I never knew why the convention was like that, it seems backwards to me as well, but that's how it is.

dwringer1y ago

"Document ranking reduces to these hard problems" would imply that document ranking is itself an instance of a certain group of hard problems. That's not what the article is saying.

moralestapia1y ago

I know its counterintuitive, as I explained in my comment, but that's the correct terminology in CS world.

markerz1y ago

1 more reply

Ar-Curunir1y ago

If A reduces to B, it means that an algorithm implementing B can be used (with some pre- and post-processing) to solve A.

If A reduces to B, it means that B is at least as hard as A.

This is the standard terminology in every theoretical computer science; see for example the DPV textbook on page 210: https://github.com/eherbold/berkeleytextbooks/blob/master/Al...

1 more reply

bubblyworld1y ago

moralestapia1y ago

Wrong.

At least bother to read the discussion in the sibling comments.

bubblyworld1y ago

It's standard terminology. I'm not going to waste time arguing about it.

1 more reply

j / k navigate · click thread line to collapse