No, it doesn't turn into this. Those two bytes of leftovers plus a flag are kept inside the stream generator that transforms bytes into code points, every time you pull it those two bytes are used as an initial accumulator in the fold that takes the chunk of bytes and yield chunk of code points and the updated accumulator. You don't need to inline it all into one giant transform.
Come on, it's how (mature libraries of) parser combinators work. The only slightly tricky part here is detecting leftover data in the pipeline.