The user-side code basically describes what the UI should look like in the current frame, those "instructions" are recorded, and this recording is reasonably cheap.
The UI backend can then figure out how to "diff" the new instruction stream against the current internal state and render this with the least changes to the screen.
However some immediate mode UI systems came to the conclusion that it might actually be cheaper to just render most things from scratch instead of spending lots of processing resources to figure out what needs to be updated.
In conclusion: "Immediate Mode UI" doesn't say anything how the UI is actually rendered or generally how the internals are implemented, it only describes how the public API works.
If the public API requires you to give a new paint command on every frame (everytime the scrollbar is dragged), then regardless of whether the underlying rendering engine performs each of these paint commands, you still have to run through every item of the list (and so does the diff'ing code), making this a O(N) operation on every frame.
But I guess different UI frameworks have different solution for this. Creating and updating a 1 million item list wouldn't be a cheap operation in a traditional UI system either.
IDK, works pretty well with Qt's item model system