Their API responses were in some absolutely insane markup language that I'd never seen before. I actually had to spend a good deal of time reading up on the history of markup languages, carefully going through each one to see if the syntax matched.
Eventually I gave up and just had to write a parser myself. The worst bit was that the attributes didn't use quotation marks around the values. So you'd literally have markup like:
<something name=Hello world />
It was...fun times. // At this point, I'd like to take a moment to speak to you about the Adobe PSD format.
// PSD is not a good format. PSD is not even a bad format. Calling it such would be an
// insult to other bad formats, such as PCX or JPEG. No, PSD is an abysmal format. Having
// worked on this code for several weeks now, my hate for PSD has grown to a raging fire
// that burns with the fierce passion of a million suns.
// If there are two different ways of doing something, PSD will do both, in different
// places. It will then make up three more ways no sane human would think of, and do those
// too. PSD makes inconsistency an art form. Why, for instance, did it suddenly decide
// that *these* particular chunks should be aligned to four bytes, and that this alignement
// should *not* be included in the size? Other chunks in other places are either unaligned,
// or aligned with the alignment included in the size. Here, though, it is not included.
// Either one of these three behaviours would be fine. A sane format would pick one. PSD,
// of course, uses all three, and more.
// Trying to get data out of a PSD file is like trying to find something in the attic of
// your eccentric old uncle who died in a freak freshwater shark attack on his 58th
// birthday. That last detail may not be important for the purposes of the simile, but
// at this point I am spending a lot of time imagining amusing fates for the people
// responsible for this Rube Goldberg of a file format.
// Earlier, I tried to get a hold of the latest specs for the PSD file format. To do this,
// I had to apply to them for permission to apply to them to have them consider sending
// me this sacred tome. This would have involved faxing them a copy of some document or
// other, probably signed in blood. I can only imagine that they make this process so
// difficult because they are intensely ashamed of having created this abomination. I
// was naturally not gullible enough to go through with this procedure, but if I had done
// so, I would have printed out every single page of the spec, and set them all on fire.
// Were it within my power, I would gather every single copy of those specs, and launch
// them on a spaceship directly into the sun.
//
// PSD is not my favourite file format.Probably because, like many other ancient document formats (e.g. MS Office), it was a straight dump of memory structures into a file [1]. Obviously a very bad idea in hindsight (especially given the truckload of deserialization vulns resulting from it), but computers from that age were so memory-constrained that anything else wouldn't cut it, and by the time computers got more powerful the old formats were hopelessly entrenched.
[1] https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...
No updates other than straight up SDK bumps and recompiles, broken loading of random images in recent macOS/Apple Silicon, they somehow managed to break cropping in one of the two or three updates they did, still an Intel binary. Clearly they haven't tested it more than just checking if the app opens.
Really wish Dag had just open sourced Xee3 instead, my opinion of MacPaw plumetted after seeing how they massacred my boy Xee.
The Archive Browser was equally neglected. At least The Unarchiver still works, which in retrospect was clearly the only app MacPaw wanted to take off Dag's hands.
They also don't want anybody building a dependency on that sh*t, which would prevent them from ever cleaning up the mess.
I speculate that one of the ways this happens is that someone decides or is told to use format Foo. Then they and possible collaborators implement both the writer and the reader for their idea of Foo from scratch, never testing with an off-the-shelf standard parser.
You'd think that doing XML like this is unlikely, given how easily available correct and validating parsers have been. But I've nevertheless seen this with XML too. I speculate that sometimes the programmer is on a platform that doesn't have an easily available off-the-shelf parser/writer, or they simply don't know about it.
I've also seen a variation of this, in half-butted "integrations", like to have a sales check-off feature of "we can generate X". These are sometimes tested only lightly, and sometimes not at all (such as when they don't have access to the tool that uses that format, and they were just working from poor documentation or an example). It's a thing.
I bet this sounds surreal to people visiting this site, but there are really corporations out there running on software written by people who never heard of XML. Another example is a "database" implementation I have seen in a multi-billion dollar company which relied on a hierarchy of directories containing JSON files mimicking tables and rows inside a relational DB.
The particular product in question had tens of millions of dollars yearly revenue.
Guilty.
Although in my defence it was during the early days of XML and the platform options had their own problems.
One morning I was working on their login flow - not doing anything crazy, mind you. Just a bit weird; logging in and out, watching the req/res cycles with Charles Proxy. All of a sudden my boss comes over and tells me to stop immediately. Apparently I set off so many alarm bells at the broker that the CTO was woken up (it was 2am where they were). That was a fun gig lol.
A few years ago I was using LiDAR scanners from a manufacturer that didn't provide a linux driver, only windows - the way it worked is that you programmed the firmware to fire UDP packets at a specified IP and port and then when the device powered up it would push this continuous stream of data to you. 300,000 points a second.
So I started capturing these UDP packets and then decoding them with python, eventually I had to write a plugin in C to do the high performance parsing and bit packing, but nothing beats that feeling when you're stumped on what a bit of data means and then a eurika moment hits you in the shower, and the project advances!
The catch is... I didn't have any Internet connection. I was going to an internet cafe, logging onto the chat server, and chatting, while recording the connection with Wireshark.
At home, I'd print the hex + ASCII connection dump on my dot matrix printer, and used a highlighter and ballpoint pen to mark the fields of the message packet.
Then I'd code something around it, planned new tests, compiled a new version of the app and.... took it with me on a hard drive to the internet cafe to test next day, or next weekend.
I think I was way smarter and goal oriented than I am today.
Only when you are doing it for yourself or when it's a known undertaking. It can be very frustrating when you are integrating with some hardware and you are 99% complete and you've told everyone you are ready to ship and the last 1% is a surprise reverse protocol engineering project.
Here is an extractor I wrote for Westwood PAK and Lucasarts LFD files when I was 16: https://gist.github.com/ssg/e3e9654612be916336c01e104b10ddc7
I wound up basically re-implementing the software that was listening to create a test suite by the time it was all said and done.
- Kaitai [1], which takes as input a YAML file and can parse binary files based on it (and even generate parsers).
- ImHex [2], which has a pattern language [3] which allows parsing, and it seems more powerful than what Kaitai offers. I stumbled upon some limitations with it but it was still useful.
[1]: https://kaitai.io/
It was a TON of fun doing that. And I learned a lot from that exercise.
Fast forward to a few weeks ago I "wrote" a parser / serializer for handling knowledge-graphs as input/output between my app and LLMs.
I used ChatGPT to walk me through the whole thing. It did very good job of converting between mermaidjs and an object type I defined in typescript. It wrote the code, the unit tests - the whole thing.
I don't understand how it works. The code is great. But not as satisfying.
Good grief. It's like writing assembly: a good exercise, but only for trivial or particularly tricky parts of a program. For everything else, proper tooling (a compiler or a 3D modeler) is the way to go. I fully agree that the best learning experience is to build tools to automate away the annoying parts :-D
In my case, I am trying to unpack MIDI files that have been packaged in a proprietary format by a company called ToonTrack.
They have two product lines: expensive VST instruments and MIDI files designed to be played on those instruments. It's an open secret that you don't need the expensive VST instruments, if you're willing to navigate a somewhat tortured folder hierarchy.
Well, for this new instrument, they thought that they'd be clever and bundle their MIDI packs so that you have to buy the expensive instrument to play it. Also: no refunds.
You can see where this is going...
Where I throw stones at myself is that I have a "perfect is the enemy of good" problem in that I don't publish writeups that are incomplete, because I keep hoping that if I just spend more holidays banging on it then I'll figure out why the parser works most of the time
So I started reverse engineering the Winamp skinning engine with the intention of making an engine that can run it on one of the Linux media apps. I did this by writing short programs and looking at the generated binary.
I had about 95% of it figured out when we had a robbery and the thief took the laptop I was using. By the time I got a new machine I had completely lost interest!
I wouldn't be surprised if it was using a well known VM that I just didn't know about at the time!
Parsing a binary file is tedious but you can progress steadily at least, whereas you would never be sure you even decompressed correctly, before even trying to decode the format.
Fortunately this is mostly a theoretical problem. There are very few cases where a custom compression would be more efficient than slapping a .zip/.zstd/.tar on it if it ever goes too big.
This is the post: https://news.ycombinator.com/item?id=38772862
> but had no idea as to how the image was compressed. It clearly wasn't compressed with any common compression algorithm. Mercifully unlike the MIPS firmware, it had at least a few strings, which is how I was able to tell it was compressed; a hex dump showed chunks of human-readable text with garbage interrupting them.
> A hunch. After extensive amounts of time trying and failing to eyeball the compression algorithm from hexdumps of compressed code, and trying any decompression algorithm I could think of against it,
But they eventually could break through by reverse engineering the decompression code.
> Once I finally had a concise, sane description of the decompression algorithm in C, the algorithm turned out to be hilariously simple. I was also then able to figure out the origins of the compression algorithm; it's called LZSS
I've collected a few links people have already posted to their own projects or write-ups here and elsewhere, but is there any single excellent resource for learning how to do this?
I've a number of dead and/or proprietary formats that I've always wanted to crack open, but I'm totally overwhelmed with where to start.
First, make sure that you know what the format is actually supposed to encode. For example, if some file weighs (say) 40 KB then it is unlikely to be a raster image. The file name, if any, helps a lot to narrow the scope.
Second, you should have some understanding of similar file formats. I generally recommend to study PNG first because it gives an example of typical structured file formats and raster image formats. (Don't delve into the compression though---bitwise analysis is much harder.) This is also why you needed to know what the format is for, many formats with the same goal tend to have similar structures.
Third, collect as many examples as possible. You can line them up to see commonalities and differences and spot patterns. Even better if you can actively generate different files. This is generally the last hope when you are run out of reasonable hypotheses.
Fourth, optimize the feedback loop. You will have to do a lot of hypothesization, validation and automation. You can't really optimize the number of iterations, but you can optimize the time for a single iteration. Use a comfortable scripting language with good binary operation. I tend to use a vanilla Python with struct and make everything else by my own, but there are several libraries that greatly help you if you don't feel like doing so.
But first, it helps to have sample files to see recurring structures. Ideally, you also have access to software that generates these files. This allows you to deal with simpler files containing less information to reason about, make small changes within the program and compare the corresponding change(s) in the file.
I am trying to parse the MS Access files using NodeJS/Javascript. I last tried about 3 years ago and it was really tough going, so there is a lot of trial and error. I am able to parse some basic MS Access files, but need to figure out a way to get the whole database more reliably. My effort was here:
i've never attempted JS interop, so i don't know how much it sucks, but it definitely seems doable. it'll likely be an adventure on it's own, but it's gotta be 1000x less fuckery than trying to reverse the binary format of Access
the saner option would probably just be a small .Net program that creates an endpoint for your JS
unless i'm assuming too much, and you're just doing it for sheer masochistic pleasure, and in that case: i salute you
I was wondering whether chatgpt can be used to read in the byte sequences and umm, do something.
It's a wonderful example of inductive reasoning, or generating general rules from a collection of specific examples.