Ask HN: How are you using LLMs for traversing decompiler output?

105 pointsmjbale1161y ago45 comments

I need to reverse a binary made years ago, and I have zero experience with cpp, so I think it would be a good experiment to get an LLM to help me in any way

45 comments

carom1y ago

Binary Ninja has an AI integration called side kick, it has a free trial but I'm not sure it can be used in the free web version. [1]

In my experience, the off the shelf LLMs (e.g. ChatGPT) do a pretty poor job with assembly, they can not reason about the stack or stack frames well.

I think your job will be the same with or without AI. Figuring out the data structures and data types a function is operating on and naming variables.

What are you reverse engineering for? For example, getting a full compilable decompilation has different goals than finding vulnerabilities or patching a bug.

1. https://sidekick.binary.ninja/

aidanhs1y ago

Out of curiosity, what would you say the current state of the art is for full compilable decompilation? This is something I have a vague interest in but I'm not involved enough in the space to be on top of the latest and greatest tooling.

feznyng1y ago

Echoing IDA but its pricing is a huge PITA if you’re using it in a hobbyist capacity i.e. you don’t have an employer willing to pay for it. Could opt for the home version but that’s a yearly cost and you have to use their cloud decompiler. Ghidra’s your best bet if you want something FOSS and community-driven although not as great at decompilation.

mdaniel1y ago

Not only the pricing by itself, every story that I've heard about normal people trying to actually give them money is that they actually don't want to sell it to anyone other than big players

That said, depending on ones needs they do actually offer a slimmed down IDA Free: https://hex-rays.com/ida-free

I actually use AUR to more-or-less track its releases https://aur.archlinux.org/packages/ida-free

1 more reply

carom1y ago

Most decompilers do not strive for recompilability. [1] I believe there are (or were) some academic projects that aimed for recompilation as a core feature, but it is a hard problem.

On the commercial side, IDA / HexRays [2] is very strong for C-like decompilation. If you're looking at Go, Rust, or even C++ it is going to be a little bit more messy. As other commenters have said, you'll work function-by-function and it is expensive, though the free version does have decompilation (F5) for x86 and x64 (IIRC).

Binary Ninja [3] (no affiliation) is the coolest IMO, they have multiple intermediate representations they lift the assembly through. So you get like assembly -> low level IL -> medium level IL -> high level IL. There are also SSA forms (static single assignment) that can aid in programmatic analyses. The high level IL is very readable but makes no effort to be compilable as a programming language. That being said, Binary Ninja has implemented different "views" on the HLIL so you can show it as pseudo-C, Rust, etc. There is a free online version and the commercial version is cheaper than IDA but still expensive. Good Python API, good UI.

Ghidra [4] is the RE framework released by NSA. It is free and open source. It supports a ton of niche architectures. This is what most people use. I think the UI is awful, personally. It has a decompiler, the results are OK. They have an intermediate representation (P-Code) and plugins are in Java (since it is written in Java). I haven't worked much with it.

Most online decompilations you see for old games are likely using Ghidra, some might be using IDA. This is largely a manual process of doing a function at a time and building up the mental map of the program and how things interact.

Also worth mentioning are lifters. There were a few projects that aimed to lift assembly to LLVM IR (compiler framework's intermediate representation), with the idea being that then all your analyses could be written over LLVM IR as a lingua franca. Since it is in LLVM IR, it would be also recompilable and retargetable. [5][6]

1. https://reverseengineering.stackexchange.com/questions/2603/...

2. https://hex-rays.com/ida-free

3. https://binary.ninja/free/

4. https://ghidra-sre.org/

5. https://github.com/avast/retdec

6. https://github.com/lifting-bits/mcsema

r00t-1y ago

Meta has a foundation model trained on LLVM IR: https://ai.meta.com/research/publications/meta-large-languag...

1 more reply

Retr0id1y ago

Looking at an individual function, IDA hex-rays output is often recompilable as-is (or with minor modifications), but it won't necessarily be idiomatic, especially if you don't have symbol information.

th0ma51y ago

This is what I gather from reverse engineering material I've read and groups I've been around. Hidden state, hidden data structures, hidden automations all abound, and there simply isn't enough detail in the assembler itself to bridge the hardware's internal conceptualization and processes.

JosephRedfern1y ago

These guys are building foundational models for this purpose: https://reveng.ai/. The results are quite compelling, and they have plugins for your favourite reverse engineering tools.

latexr1y ago

The domain makes it look like “Revenge AI”. Terrible name. Not as risqué as some others¹ but not as fun or memorable either.

¹ https://www.snopes.com/fact-check/domain-thing/

olalonde1y ago

I was about to comment that the domain was quite clever...

mxmilkiib1y ago

oooh, it's Rev Eng checks tfa

avg_dev1y ago

i don't think so (that it is a terrible name); it's a pretty common term https://www.urbandictionary.com/define.php?term=reveng

1 more reply

netsec_burn1y ago

I made a site to use LLMs to help me with reverse engineering. The output is surprisingly readable, even with C++ classes. Let me know any feedback you might have: https://decompiler.zeroday.engineering/

readyplayernull1y ago

This is great! With Ghidra I had to look for the corresponding libs of a very specific RiscV vendor, your SRE did it by itself. You should have your own HN thread in front page!

btown1y ago

What kind of file should be uploaded?

netsec_burn1y ago

The allowed types are a bit misleading. Any binary is accepted, any architecture. You can upload shared objects, ELF executables, PE binaries, etc.

I like to give it bomb executables (reverse engineering challenges) to test it.

mdaniel1y ago

> Any binary is accepted, any architecture.

One should be careful tossing around the word "any" in relation to executable formats, for there are seemingly an unbounded number of them: https://github.com/1Password/onepassword-sdk-go/blob/v0.1.5/...

Up to you, but currently your polling endpoint just has a boolean, which is likely super easy to cook on the server side but also leads the user left wondering "uh, is this thing on?" in ways that any kind of percentage might not. IOW, how long, exactly, should any sane person wait for it to be {"status":true}?

Also, you have your ELB misconfigured because trying to upload a binary that is takes more than 30 seconds to upload causes the actual POST to puke. I'm sure that's great for hello-world.exe but is absolutely hilarious for any real binary

__alexander1y ago

Do you have experience reverse engineering? If not, LLMs are not going to help much. LLMs are useful for aiding the analysis but they don’t do the analysis.

uncomplexity_1y ago

Yea this one. If you have solid fundamentals these LLMs are really handy in assisting and never leading.

For example I have a minified javascript file, way obfuscated. I can paste the code and make it break down the initial structure. And then I tell it which parts to focus on and which parts to dig in deeper.

lumb631y ago

It has nothing to do with LLMs, but Ghidra is a wonderful tool.

Dwedit1y ago

Have you tried Ghidra yet? If you still have your debug symbols, then it can do a really good job.

flashgordon1y ago

Interesting. Wouldn't this actually be a deterministic problem based on graph analysis. Id have thought LLMs would have been more effective taking the out out some graph recognizer and then identifying what those higher level constructs map to?

warkdarrior1y ago

Deterministic maybe, but surely undecidable in the general case since you need whole program analysis to understand, for example, the purpose of a memory location. ML may help approximate this undecidable problem.

rgovostes1y ago

The LLM4Decompile project (https://github.com/albertan017/LLM4Decompile) provides some open models for binary to C decompilation and Ghidra pseudocode refinement, along with some training sets.

RevEng.ai, linked a few times already, discusses their approach here: https://blog.reveng.ai/training-an-llm-to-decompile-assembly...

mahaloz1y ago

I like using it for library function comments, variable name recovery, and sometimes types. The comments are usually hit or miss, but I find the variable names to be a bit better than auto-generated ones. I implement most of this in my decompiler plugin: https://github.com/mahaloz/DAILA; check it out if you are interested :).

stackghost1y ago

The Advent of Cyber side quest this year needed some Ghidra and I found Pickman's Model was pretty good at helping me craft a heap exploit from a decompilation.

jkstill1y ago

I've only played a with this, but it was impressive.

https://ghidra-sre.org/

userbinator1y ago

Unfortunately LLMs are not good at precision and details, which is exactly what you need for the sort of analysis you're trying to do.

menaerus1y ago

Right. Have a look at the paper above from Meta on how they fine-tuned the Code Llama with LLVM IR to beat the compiler in producing size-optimized binaries.

apatheticonion1y ago

Inspired by the work out there that reverse engineers game engines, I've always wanted to try my hand at reverse engineering to contribute to the world of game preservation.

Is it actually legal to decompile a game engine from executables/dll files, write new sources by making sense of the output and rewriting it such that it can be compiled targeting modern APIs?

I feel like that must be illegal

feznyng1y ago

You could use the LLM to help you write utility scripts for whatever disassembler you’re using e.g. python for IDA. That might work better than feeding it raw assembly.

Game RE communities also have all sorts of neat utilities for decompiling large cpp binaries. Skyrim’s community is pretty active with ghidra/ida.

Guessing you’re not lucky enough to have a PDB?

tonetegeatinst1y ago

PDB?

feznyng1y ago

Program database file - only relevant if the binary is Windows. But that makes decomp an order of magnitude easier. I’d be surprised if OP had one though.

klmitchell21y ago

https://github.com/radareorg/r2ai

sitkack1y ago

Do you know the compiler and what the source possibly looks like? I found LLMs are pretty good at recovering code from binaries, they need help though.

If you are able to run the program and collect traces, that will help a ton.

svilen_dobrev1y ago

cpp? that's a preprocessor. u mean c++?

LLM won't help you much if u can't understand what it's talking about.

Manual way is, given ELF (linux executable format) somexe,

$ strings somexe

$ objdump -d somexe

$ objdump -s -j .ro data somexe

then look+ponder over the results.

and/or running ghidra (as mouse'd UI) over it.. which may help somewhat but not 100%

Have in mind, that objdump and ghidra have opposite ways of showing assembly transfer/multi-operand instructions - one has mov dest,target , other has mov target,dest - for same code.

no idea on (recent) windoze front. IDA ?

u53rn4m31y ago

RevEng.AI have their own foundational AI models for decompilation with English language summaries.

seba_dos11y ago

Good luck. If that's how you're approaching it, you're going to need it.

2-3-7-43-18071y ago

op apparently never even heard about reddit

ianhawes1y ago

Highly recommend it. I reversed an app with o1 Pro Mode and the analysis of the obfuscated C# code matched up accurately with what I eventually discovered by manually reversing.

chc41y ago

Reverse engineering C# is extremely different from C++ binaries.

j / k navigate · click thread line to collapse

45 comments

carom1y ago

Binary Ninja has an AI integration called side kick, it has a free trial but I'm not sure it can be used in the free web version. [1]

In my experience, the off the shelf LLMs (e.g. ChatGPT) do a pretty poor job with assembly, they can not reason about the stack or stack frames well.

I think your job will be the same with or without AI. Figuring out the data structures and data types a function is operating on and naming variables.

What are you reverse engineering for? For example, getting a full compilable decompilation has different goals than finding vulnerabilities or patching a bug.

1. https://sidekick.binary.ninja/

aidanhs1y ago

feznyng1y ago

mdaniel1y ago

Not only the pricing by itself, every story that I've heard about normal people trying to actually give them money is that they actually don't want to sell it to anyone other than big players

That said, depending on ones needs they do actually offer a slimmed down IDA Free: https://hex-rays.com/ida-free

I actually use AUR to more-or-less track its releases https://aur.archlinux.org/packages/ida-free

1 more reply

carom1y ago

Most decompilers do not strive for recompilability. [1] I believe there are (or were) some academic projects that aimed for recompilation as a core feature, but it is a hard problem.

1. https://reverseengineering.stackexchange.com/questions/2603/...

2. https://hex-rays.com/ida-free

3. https://binary.ninja/free/

4. https://ghidra-sre.org/

5. https://github.com/avast/retdec

6. https://github.com/lifting-bits/mcsema

r00t-1y ago

Meta has a foundation model trained on LLVM IR: https://ai.meta.com/research/publications/meta-large-languag...

1 more reply

Retr0id1y ago

th0ma51y ago

JosephRedfern1y ago

These guys are building foundational models for this purpose: https://reveng.ai/. The results are quite compelling, and they have plugins for your favourite reverse engineering tools.

latexr1y ago

The domain makes it look like “Revenge AI”. Terrible name. Not as risqué as some others¹ but not as fun or memorable either.

¹ https://www.snopes.com/fact-check/domain-thing/

olalonde1y ago

I was about to comment that the domain was quite clever...

mxmilkiib1y ago

oooh, it's Rev Eng checks tfa

avg_dev1y ago

i don't think so (that it is a terrible name); it's a pretty common term https://www.urbandictionary.com/define.php?term=reveng

1 more reply

netsec_burn1y ago

readyplayernull1y ago

This is great! With Ghidra I had to look for the corresponding libs of a very specific RiscV vendor, your SRE did it by itself. You should have your own HN thread in front page!

btown1y ago

What kind of file should be uploaded?

netsec_burn1y ago

The allowed types are a bit misleading. Any binary is accepted, any architecture. You can upload shared objects, ELF executables, PE binaries, etc.

I like to give it bomb executables (reverse engineering challenges) to test it.

mdaniel1y ago

> Any binary is accepted, any architecture.

__alexander1y ago

Do you have experience reverse engineering? If not, LLMs are not going to help much. LLMs are useful for aiding the analysis but they don’t do the analysis.

uncomplexity_1y ago

Yea this one. If you have solid fundamentals these LLMs are really handy in assisting and never leading.

lumb631y ago

It has nothing to do with LLMs, but Ghidra is a wonderful tool.

Dwedit1y ago

Have you tried Ghidra yet? If you still have your debug symbols, then it can do a really good job.

flashgordon1y ago

warkdarrior1y ago

rgovostes1y ago

The LLM4Decompile project (https://github.com/albertan017/LLM4Decompile) provides some open models for binary to C decompilation and Ghidra pseudocode refinement, along with some training sets.

RevEng.ai, linked a few times already, discusses their approach here: https://blog.reveng.ai/training-an-llm-to-decompile-assembly...

mahaloz1y ago

stackghost1y ago

The Advent of Cyber side quest this year needed some Ghidra and I found Pickman's Model was pretty good at helping me craft a heap exploit from a decompilation.

jkstill1y ago

I've only played a with this, but it was impressive.

https://ghidra-sre.org/

userbinator1y ago

Unfortunately LLMs are not good at precision and details, which is exactly what you need for the sort of analysis you're trying to do.

menaerus1y ago

Right. Have a look at the paper above from Meta on how they fine-tuned the Code Llama with LLVM IR to beat the compiler in producing size-optimized binaries.

apatheticonion1y ago

Inspired by the work out there that reverse engineers game engines, I've always wanted to try my hand at reverse engineering to contribute to the world of game preservation.

Is it actually legal to decompile a game engine from executables/dll files, write new sources by making sense of the output and rewriting it such that it can be compiled targeting modern APIs?

I feel like that must be illegal

feznyng1y ago

You could use the LLM to help you write utility scripts for whatever disassembler you’re using e.g. python for IDA. That might work better than feeding it raw assembly.

Game RE communities also have all sorts of neat utilities for decompiling large cpp binaries. Skyrim’s community is pretty active with ghidra/ida.

Guessing you’re not lucky enough to have a PDB?

tonetegeatinst1y ago

PDB?

feznyng1y ago

Program database file - only relevant if the binary is Windows. But that makes decomp an order of magnitude easier. I’d be surprised if OP had one though.

klmitchell21y ago

https://github.com/radareorg/r2ai

sitkack1y ago

Do you know the compiler and what the source possibly looks like? I found LLMs are pretty good at recovering code from binaries, they need help though.

If you are able to run the program and collect traces, that will help a ton.

svilen_dobrev1y ago

cpp? that's a preprocessor. u mean c++?

LLM won't help you much if u can't understand what it's talking about.

Manual way is, given ELF (linux executable format) somexe,

$ strings somexe

$ objdump -d somexe

$ objdump -s -j .ro data somexe

then look+ponder over the results.

and/or running ghidra (as mouse'd UI) over it.. which may help somewhat but not 100%

Have in mind, that objdump and ghidra have opposite ways of showing assembly transfer/multi-operand instructions - one has mov dest,target , other has mov target,dest - for same code.

no idea on (recent) windoze front. IDA ?

u53rn4m31y ago

RevEng.AI have their own foundational AI models for decompilation with English language summaries.

seba_dos11y ago

Good luck. If that's how you're approaching it, you're going to need it.

2-3-7-43-18071y ago

op apparently never even heard about reddit

ianhawes1y ago

Highly recommend it. I reversed an app with o1 Pro Mode and the analysis of the obfuscated C# code matched up accurately with what I eventually discovered by manually reversing.

chc41y ago

Reverse engineering C# is extremely different from C++ binaries.

j / k navigate · click thread line to collapse