> infgen is a deflate stream disassembler. It will read a gzip, zlib, or raw deflate stream, and output a readable description of the contents.
Thanks, this would have saved me a bunch of time and I wish I'd known this before writing this article. I'll add it in as a footnote later on.
For anyone who's curious, here's the output of the infgen program on the gzip file in the article (I remade the file so the timestamp is different)
$ cat test.out.gz | ./infgen -ddis
! infgen 2.6 output
!
time 1637813470 ! UTC Thu Nov 25 04:11:10 2021
name 'test.out
gzip
!
last ! 1
fixed ! 01
literal 'a ! 10010001
literal 'a ! 10010001
match 6 1 ! 00000 0000100
literal 10 ! 00111010
end ! 0000000
! stats literals 8.0 bits each (24/3)
! stats matches 66.7% (1 x 6.0)
! stats inout 5:6 (4) 9 0
! 00
! stats total inout 5:6 (4) 9
! stats total block average 9.0 uncompressed
! stats total block average 4.0 symbols
! stats total literals 8.0 bits each
! stats total matches 66.7% (1 x 6.0)
!
crc
lengthThat's why you should always use "gzip -n". Having the file name and timestamp in the compressed file header was IMO a mistake, and newer single-file compressors (bzip2, xz, zstd) AFAIK don't even have these fields in their native file format.
[0]: https://github.com/davecom/nflate [1]: https://commandlinefanatic.com/cgi-bin/showarticle.cgi?artic...
The main idea is that a stream is composed of blocks, and dynamic blocks can contain Huffman tables followed by a end-of-block symbol. Since there's no other symbols, such blocks won't produce any output when decompressed. So there's a few bytes for those tables that can be changed (e.g. to put a little ascii signature). However, this affects the codes present in the tables, which must pass some validations (for decompressors to unambiguously match parsed bits to codes). I used a SMT solver to figure out values for the remaining bytes, so that the tables had valid code counts.
[0]: https://nevesnunes.github.io/blog/2021/09/29/Decompression-M...
I wish there was a widely accepted standard to write down arbitrary binary formats so one could inspect/debug them. Or even use it as a library to puck/unpack the data. I guess the 010 editor comes close with it's binary templates.
Don’t forget that the huffman codes are packed LSB to MSB, but are to be interpreted as an integer in big endian format. Why this insanity?
Because that's what Phil Katz did. I vaguely remember reading about that years ago, and believe it had to do with how he implemented it; not as canonical table lookup, but as explicit tree-walking, since the former hadn't really been invented yet and the latter would fit within the memory constraints of a 16-bit system (DOS on a PC).
[1] https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...
( echo H4sIAAAAAAAAA5Pv5mCIeLzKnIGZgYHh/38GEYbX/+WxiDl1 ;
echo KB5hIIJ1TXfrBNeU3DznNGYG78SS1AKgEchsAEthdGVwAAAA ) |
base64 -d | gunzip > quine.gz
zcat quine.gz | diff - quine.gzThere’s some superfluous echo and subshell in there, incidentally; I like it better written this way:
base64 -d <<<H4sIAAAAAAAAA5Pv5mCIeLzKnIGZgYHh/38GEYbX/+WxiDl1KB5hIIJ1TXfrBNeU3DznNGYG78SS1AKgEchsAEthdGVwAAAA | gunzip > quine.gz
zcat quine.gz | diff - quine.gzI personally think it's easiest if you write bits Left to right, from LSB to MSB, which fits how humans write words, but i also get that when you're programming, you tend to think of MSB being on the left and LSB on the right (for bit twiddling operations)
So I can see why people would write bits right to left too, to fit how you program, but frankly I don't think that is any easier to read on paper.