A 30% saving is roughly what I would expect.
Commonly JPEG separates the image into 1 channel of luma and 2 channels of chroma. It then downsamples the chroma to half the luma resolution, meaning that you have twice of much raw luma data as you have chroma.
It then goes on to do a whole load of fun discrete cosine transform, quantization and huffman encoding, but to a first approximation, I'd expect them to compress roughly similarly on average.
Since you've got twice as much luma data as chroma data, dropping the chroma will only save you ~30%.
Since the predictions from the neural network are not always accurate, you could then encode the error in the compressed file, and restore it on decompression. This will generally be close to 0, so compression should be pretty good.
At the end of the day, 30% better compression would be great... if the processing overheads aren't too great - I think that's the key factor.
(Most of this is from memory based on some hacking I did to be able to transcode videos to play on a cheap Chinese iPod Nano clone, which used an undocumented variant of MJPEG. The default quantization tables for luma and chroma are different from each other. The iPod Nano clone was using the standard quantization tables but the other way round (so using the luma table for chroma and the chroma table for luma). I can only imagine this was a bug in their code, as it was bound to reduce their compression ratio/image fidelity.)