Decompressing a Gzip File by Hand (opens in new tab)

(thomastay.dev)

145 pointsthomastay4y ago18 comments

18 comments

For anyone looking to write their own Deflate/Gzip/Zlib, I would recommend looking at the little program called `infgen`:

> infgen is a deflate stream disassembler. It will read a gzip, zlib, or raw deflate stream, and output a readable description of the contents.

https://github.com/madler/infgen

thomastayOP4y ago

Wow! This program is really cool and I'm honestly not surprised it's written by Mark Adler, the guy who wrote zlib, that guy is an amazing programmer.

Thanks, this would have saved me a bunch of time and I wish I'd known this before writing this article. I'll add it in as a footnote later on.

For anyone who's curious, here's the output of the infgen program on the gzip file in the article (I remade the file so the timestamp is different)

  $ cat test.out.gz | ./infgen -ddis
  ! infgen 2.6 output
  !
  time 1637813470 ! UTC Thu Nov 25 04:11:10 2021
  name 'test.out
  gzip
  !
  last                    ! 1
  fixed                   ! 01
  literal 'a              ! 10010001
  literal 'a              ! 10010001
  match 6 1               ! 00000 0000100
  literal 10              ! 00111010
  end                     ! 0000000
  ! stats literals 8.0 bits each (24/3)
  ! stats matches 66.7% (1 x 6.0)
  ! stats inout 5:6 (4) 9 0
                          ! 00
  ! stats total inout 5:6 (4) 9
  ! stats total block average 9.0 uncompressed
  ! stats total block average 4.0 symbols
  ! stats total literals 8.0 bits each
  ! stats total matches 66.7% (1 x 6.0)
  !
  crc
  length

cesarb4y ago

> (I remade the file so the timestamp is different)

That's why you should always use "gzip -n". Having the file name and timestamp in the compressed file header was IMO a mistake, and newer single-file compressors (bzip2, xz, zstd) AFAIK don't even have these fields in their native file format.

Rendello4y ago

Woah! I didn't even notice it was him. Most of the accepted answers relating to anything Deflate are by him on Stack Overflow, so I shouldn't be surprised.

tinus_hn4y ago

If you use the < operator you don’t need to spawn a ‘cat’ process that does nothing but copy around the contents of the file.

WoodenChair4y ago

Programming a gzip (deflate) decompressor is a really good learning project. I did it last year, ultimately producing a pretty poor but working implementation [0]. I ended up modifying Mark Adler's puff example program to see the intermediary tables it produces to help debug my own implementation. I wish I had known about infgen (also by Adler). The other resource I would recommend, beyond the official specs is this article by Joshua Davies[1], also mentioned in the original post here.

[0]: https://github.com/davecom/nflate [1]: https://commandlinefanatic.com/cgi-bin/showarticle.cgi?artic...

0d0a4y ago

This tool served as a base for a fun project of mine [0], which consisted on injecting arbitrary bytes to a deflate stream without affecting the decompression output.

The main idea is that a stream is composed of blocks, and dynamic blocks can contain Huffman tables followed by a end-of-block symbol. Since there's no other symbols, such blocks won't produce any output when decompressed. So there's a few bytes for those tables that can be changed (e.g. to put a little ascii signature). However, this affects the codes present in the tables, which must pass some validations (for decompressors to unambiguously match parsed bits to codes). I used a SMT solver to figure out values for the remaining bytes, so that the tables had valid code counts.

[0]: https://nevesnunes.github.io/blog/2021/09/29/Decompression-M...

kuroguro4y ago

Oh, that's really nice.

I wish there was a widely accepted standard to write down arbitrary binary formats so one could inspect/debug them. Or even use it as a library to puck/unpack the data. I guess the 010 editor comes close with it's binary templates.

userbinator4y ago

One of the mantras I used when I was a programming instructor was "How can you know what to tell the computer to do, if you can't do it yourself?" This is a great example of that.

Don’t forget that the huffman codes are packed LSB to MSB, but are to be interpreted as an integer in big endian format. Why this insanity?

Because that's what Phil Katz did. I vaguely remember reading about that years ago, and believe it had to do with how he implemented it; not as canonical table lookup, but as explicit tree-walking, since the former hadn't really been invented yet and the latter would fit within the memory constraints of a 16-bit system (DOS on a PC).

pajko4y ago

Looks like the bytes shall not be reversed beforehand. Bits are easier to extract at the lower end. For example, let's take the first byte as an example, 01001011. The first field is one bit wide, indicating the final record of the stream, 01001011 & 1 = '1'. Remove the extracted bit by shifting, 0100101 remains. The next field is two bits wide, so extract those: 0100101 & 11 = '01' and remove the bits giving 01001. And so on...

lifthrasiir4y ago

Not so much. There are several degrees of freedom in this design, see ryg's Reading bits in far too many ways [1] for the introduction.

[1] https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...

bloak4y ago

And here's a gzip quine for anyone who likes that sort of thing:

  ( echo H4sIAAAAAAAAA5Pv5mCIeLzKnIGZgYHh/38GEYbX/+WxiDl1 ;
    echo KB5hIIJ1TXfrBNeU3DznNGYG78SS1AKgEchsAEthdGVwAAAA ) |
  base64 -d | gunzip > quine.gz
  zcat quine.gz | diff - quine.gz

chrismorgan4y ago

For some reason I’m unsettled by the fact that, as presented, you have to unzip it once before it become a quine for subsequent unzippings. It lets you save 56 characters of base64 data, at the cost of `| gunzip`.

There’s some superfluous echo and subshell in there, incidentally; I like it better written this way:

  base64 -d <<<H4sIAAAAAAAAA5Pv5mCIeLzKnIGZgYHh/38GEYbX/+WxiDl1KB5hIIJ1TXfrBNeU3DznNGYG78SS1AKgEchsAEthdGVwAAAA | gunzip > quine.gz
  zcat quine.gz | diff - quine.gz

ryan-duve4y ago

This was a great read. I look forward to the Part II teased at the end. In addition to deep diving on the Huffman codes, I hope you find a way to include an explanation of when to read left-to-right vs right-to-left, which is what tripped me up the most in this article.

thomastayOP4y ago

Thanks, appreciate it. Honestly, the switch from little endian to big endian within the same compressed blocks is incredibly confusing and is probably one of the things that is so weird about the Deflate specification. I had to read that part of the spec over and over again, and compare it to other people's blog posts (made harder because some people write bits from Right to Left???)

I personally think it's easiest if you write bits Left to right, from LSB to MSB, which fits how humans write words, but i also get that when you're programming, you tend to think of MSB being on the left and LSB on the right (for bit twiddling operations)

So I can see why people would write bits right to left too, to fit how you program, but frankly I don't think that is any easier to read on paper.

Rendello4y ago

It's funny, just earlier today I decided to pick up my Deflate project again (for the fourth time), and I was reading up on the odd ordering stuff and struggling.

Koffiepoeder4y ago

I usually write them RTL because it better aligns with shift operations in my head. This makes it easier to think about certain kinds of binary algorithms in my opinion.

oldpatricka4y ago

I love this! Great post.

j / k navigate · click thread line to collapse

18 comments

Rendello4y ago

For anyone looking to write their own Deflate/Gzip/Zlib, I would recommend looking at the little program called `infgen`:

> infgen is a deflate stream disassembler. It will read a gzip, zlib, or raw deflate stream, and output a readable description of the contents.

https://github.com/madler/infgen

thomastayOP4y ago

Wow! This program is really cool and I'm honestly not surprised it's written by Mark Adler, the guy who wrote zlib, that guy is an amazing programmer.

Thanks, this would have saved me a bunch of time and I wish I'd known this before writing this article. I'll add it in as a footnote later on.

For anyone who's curious, here's the output of the infgen program on the gzip file in the article (I remade the file so the timestamp is different)

  $ cat test.out.gz | ./infgen -ddis
  ! infgen 2.6 output
  !
  time 1637813470 ! UTC Thu Nov 25 04:11:10 2021
  name 'test.out
  gzip
  !
  last                    ! 1
  fixed                   ! 01
  literal 'a              ! 10010001
  literal 'a              ! 10010001
  match 6 1               ! 00000 0000100
  literal 10              ! 00111010
  end                     ! 0000000
  ! stats literals 8.0 bits each (24/3)
  ! stats matches 66.7% (1 x 6.0)
  ! stats inout 5:6 (4) 9 0
                          ! 00
  ! stats total inout 5:6 (4) 9
  ! stats total block average 9.0 uncompressed
  ! stats total block average 4.0 symbols
  ! stats total literals 8.0 bits each
  ! stats total matches 66.7% (1 x 6.0)
  !
  crc
  length

cesarb4y ago

> (I remade the file so the timestamp is different)

Rendello4y ago

Woah! I didn't even notice it was him. Most of the accepted answers relating to anything Deflate are by him on Stack Overflow, so I shouldn't be surprised.

tinus_hn4y ago

If you use the < operator you don’t need to spawn a ‘cat’ process that does nothing but copy around the contents of the file.

WoodenChair4y ago

[0]: https://github.com/davecom/nflate [1]: https://commandlinefanatic.com/cgi-bin/showarticle.cgi?artic...

0d0a4y ago

This tool served as a base for a fun project of mine [0], which consisted on injecting arbitrary bytes to a deflate stream without affecting the decompression output.

[0]: https://nevesnunes.github.io/blog/2021/09/29/Decompression-M...

kuroguro4y ago

Oh, that's really nice.

userbinator4y ago

One of the mantras I used when I was a programming instructor was "How can you know what to tell the computer to do, if you can't do it yourself?" This is a great example of that.

Don’t forget that the huffman codes are packed LSB to MSB, but are to be interpreted as an integer in big endian format. Why this insanity?

pajko4y ago

lifthrasiir4y ago

Not so much. There are several degrees of freedom in this design, see ryg's Reading bits in far too many ways [1] for the introduction.

[1] https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...

bloak4y ago

And here's a gzip quine for anyone who likes that sort of thing:

  ( echo H4sIAAAAAAAAA5Pv5mCIeLzKnIGZgYHh/38GEYbX/+WxiDl1 ;
    echo KB5hIIJ1TXfrBNeU3DznNGYG78SS1AKgEchsAEthdGVwAAAA ) |
  base64 -d | gunzip > quine.gz
  zcat quine.gz | diff - quine.gz

chrismorgan4y ago

There’s some superfluous echo and subshell in there, incidentally; I like it better written this way:

  base64 -d <<<H4sIAAAAAAAAA5Pv5mCIeLzKnIGZgYHh/38GEYbX/+WxiDl1KB5hIIJ1TXfrBNeU3DznNGYG78SS1AKgEchsAEthdGVwAAAA | gunzip > quine.gz
  zcat quine.gz | diff - quine.gz

ryan-duve4y ago

thomastayOP4y ago

So I can see why people would write bits right to left too, to fit how you program, but frankly I don't think that is any easier to read on paper.

Rendello4y ago

It's funny, just earlier today I decided to pick up my Deflate project again (for the fourth time), and I was reading up on the odd ordering stuff and struggling.

Koffiepoeder4y ago

I usually write them RTL because it better aligns with shift operations in my head. This makes it easier to think about certain kinds of binary algorithms in my opinion.

oldpatricka4y ago

I love this! Great post.

j / k navigate · click thread line to collapse