You know what I meant: How can we have confidence that this implementation of RAR is functionally identical to what it's based on? What would give me the confidence to use it in a critical piece of infrastructure?
Because it's a defined format there can be binary exact comparisons between the input and output files - we already have an oracle in the form of proper RAR format software, so if they are identical, you don't need to look further for that specific case.
You can see a version of this that I did quite similarly, for postgresql wire format, here: https://github.com/pgdogdev/pgdog/tree/main/integration/sql
It validates that sql with the same setup, teardown, and test results in perfectly exact compatibility between raw postgresql as the control and various configurations of PgDog, with both the text format and binary format, so ultimately a 6-way multivariate test that should always result in binary-exact results.
You also know what I meant, since I spelled it out in more detail a comment later. But even though you're being facetious, yes, that really is the case. If it works it works. That's the bar for the vast, vast majority of software, and has been since forever. Demonstrated practical correctness. If you stumble into a bug, you log it as a defect and then either wait for a fix or fix it yourself depending. That's all that regular people ever have. In the case of this project, this was achieved via fuzz testing.
It's literally no different to e.g. validating the NTFS driver that ships in the Linux kernel, or validating any other (re)implementation of anything. You just do a bunch of empirical testing and hope for the best. It is also why reimplementations always lag behind, which I'm not suggesting is not a real concern (or that defects wouldn't be). It's just not a gotcha.
Hell, I'm 99% sure this is exactly what the actual vendor does too, or at least I sure hope that they do have tests at least. Cause they're sure as shit not using a formally verified compiler toolchain, meaning they definitely don't have a formal proof about whether even the official implementation in itself is correct. Only empirical data at best too.
I get that this is often the case, but it does feel like we should be able to do better. At least when humans write this code you can have the expectation that there was real intent behind making sure the semantics of the code are aligned with the specification. At least with current language models, they tend to just brute-force test suite acceptance until everything passes, in a way no human developer has the capacity for. Of course this is often how it works with humans too (i.e. the classic Oracle story), but it does feel wrong.
Can we be sure that this method has produced a correct artefact without years of extensive usage? Probably not, hence my reluctance to rely on something like this, at least initially.
There's a lot of chatter lately e.g. about using TLA+ for formal modeling, so that anything downstream can be formally proven. That helps, but then the formal model still needs to be crafted somehow, which means a pass of semantic interpretation.
Going from binary to spec mechanistically via formal proofs would be possible, but only if there was a formal spec for the binary structure and the ISA available. In practice, both are just natural language prose too however, meaning another interpretation pass or two. The ISA specs also keep a lot implementation-defined / undefined afaik, for microarchitecture-level optimization freedom.
Netlists, PDK, and the likes then might be public for some RISC-V designs these days, but to get the actual chip behavior requires EM simulation typically on a scale that is not possible for any chip performant enough to be of interest. And RISC-V is not a very broadly adopted platform for proprietary consumer software.
Having the human do the semantic mapping is expensive and legally stricken. Having an LLM do it is more risk, but way, way, cheaper and currently legally grey. And both can and do make mistakes.
This is why I see this so bleakly. That said, I do also think formats like this are delicate enough that even rudimentary empirical testing should provide a surprisingly decent behavioral coverage. There's a reason that "I can't believe anything ever works at all" is such a common sentiment. Practical usage is a surprisingly powerful gate, and fuzzing in particular is basically that on steroids.
I do nevertheless still secretly get the heebie-jeebies from the Linux NTFS implementation though (me bringing that up was no coincidence).