To me this reads a bit confused, but perhaps I'm misreading it? In X11 terminology the server is sitting in front of you (the one that draws to the screen), so no, you don't need need the remote host to be running X11 server.
You do need the program that draws to the screen, but I think it's fair to say the remote host is headless if it doesn't have a GPU nor a program to interface with the GPU at all. All the remote host needs is code to interact with such a server over TCP or Unix domain sockets. And that code is tiny, even small computers without memory for frame buffer can do it.
> So I don't see how X11 is any better - it's just worse due to having abysmal performance. X11 was never designed for real world remote desktop usage - it just happens to have network transparency. So it's X11 that's a kludge for such scenario if anything.
I think X11 was actually pretty great at the time it was created, i.e. clients can create ids and use them in their requests (no round-trip to the server) and server can contain large client bitmaps that the client can operate on, but sometimes poor client coding can kill the performance over the network. As worst offender I once noticed VirtualBox did a looooot of synchronous property requests during its startup instead of doing them in concurrently, stretching the startup time from seconds to minute or more. (Whether it truly needed those properties in the first place is another question.)
Sending the complete interaction as a video stream? That's what I'd call a hack—though X11 should be modernized in various aspects, for example to support more advanced encodings for media, controlled by the client.
In some sense the web is the direction where I would have liked to see X11 going: still controlled by the client, but some light server-side code could be used to render and interact with the widgets. This way clicks would react immediately, but you would still be interacting with the actual service running on the remote host, not just a local program.
(Another reason why I consider X11 better is the separation of the server and the compositor.)