IMO, the "real" answer - the viable one :) - is a redesigned console model that supersets/encapsulates what already exists in such a way that things remain backward compatible but incorporate features that allow for progressive enhancement. The question is in how to actually build that out; what to start with, what to do when and where, etc.
Probably the most important thing I'd start with is having each console be like an independent terminal server: make it so anything can watch all the stdio streams of everything attached to the PTY, make it possible to introspect the resulting terminal stream, etc. Then the TTY itself could be queried to get the character cell grid, query individual characters, etc. And also make it possible to change arbitrary PTY+TTY settings [out from under whatever's using a given console] as well.
By "anything can watch" I mean that there would be an actual "terminal server" somewhere, likely in a process that owned a bunch of ptys, and this would have an IPC API to do monitoring and so forth. Obviously security and permissions would need to be factored in.
But this would roughly take the best from the UNIX side (line disciplines are kind of cool, having the PTY architecture Just Work with RS232 is... I understand the history, but it makes for an interesting current-day status quo, IMO), and then combine this with the best of the Win32 side (reading what's on the screen!!!!! Yes please!!).
I'm not sure how to build something sane that could incorporate graphics though. ReGIS and Sixel are... no. 8-bit cleanness is an unfortunately-probable requirement for portability (at least UTF-8 can be shooed away in broken environments with a LC_ALL=C), but base64 encoding is also equally no. Referencing files (what Enlightenment's Terminology does) is a nononono. w3m-image's approach of taking over the X11 drawable associated with the terminal window is awesome in its hilarious terribleness. The best I can think of is a library that all image/UI operations would be delegated to, which would do some escape-sequence dances with the terminal (and whatever proxies were in the way?) to detect capabilities and either use out-of-channel communications (???) to send the image data over rapidly, or alternatively 8-bit or 7-bit encode the image data into the TTY stream (worst case scenario).
This is a bit hazy/sketchily laid out, but it's something I've been thinking about for several years. When I started out pondering all of this stuff circa 2006 I was most definitely all over the place :) I'm a bit better now but I still have a lot of unresolved ideas/things. I'm trying to build a from-scratch UX that provides a more flexible model to using terminals and browsing the web, but in a way that's backward-compatible and not "different to the point of being boring".
I've (very slowly...) come to understand that slow and progressive enhancement is the only viable path forward (that people will adopt), so I'm trying to understand the best way to do that.