I agree that it is quite easy to grasp the format in terms of implementation.
It seems basically like writing a image VM that accepts byte code. I think that could really be a way to specify many file formats more concicesly. If e.g. you chose the correct automata/transducer class one can easily e.g. specify some hedge grammar based XML file format and get a binary representation. Starting from grammars as a spec it is typically more difficult if you want to derive an implementation.
However I e.g. wonder from reading the concrete spec why you e.g. cannot differentially change the alpha channel leading me to the question what happens if images have different alpha levels.