Something to keep an eye on is that Whisper is strongly bound to processing a 30-second window at a time. So if you send it 30 seconds of audio, and it decodes it, then you send it another one second of audio, the only sensible way it can work is to have it reprocess seconds 2s-30s in addition to the new data at 31s. If there was a way to have it just process the update, then there's every possibility it could avoid a lot of work.
I suspect that's what people are getting at by saying it's "not streaming": it's built as a batch process but, under some circumstances, you can run it fast enough to get away with pretending that it isn't.