Not single byte, but individual fields (float32/int32/string etc). Yes, I expected a much more significant speed-up as well. It's probably because a lot of that code was driven by reflection-type techniques.
Curiously, IronPython did better than anything (but still slow). Haven't tried Jython.
Compiling the whole thing with Cython was less effective than PyPy.