The main difference between gemini (the protocol) and http, isn't that a gemini page need only contain text and be deprived of all other kinds of context. The gemini protocol does not stop you from interacting with specific filetypes or other protocols, any more than http does. It's a protocol. It's up to people implementing clients to the gemini protocol to decide how to handle such externalities.
E.g. the lagrange browser downloads images and displays then inline in the page, which I like. I haven't tried audio files, but I imagine they could probably work similarly if necessary.
Along similar lines, the gemtext specification is not intended as some sort of viable html alternative. It has completely different design goals.
So no, the point isn't about limiting the kind of content you can interact with. The point is, by enforcing a degree of separation between what is "content" (i.e. text), and what is "externalities", this makes simplicity of text (both in terms of presentation and production) the main driver, leaving externalities up to implementations (which may be as simple as encouraging actual external use, including an http browser for http content). More importantly, this allows the user to be in control over how these potentially abusable externalities should be handled, rather than ceding that control to the webmaster, whose agenda may not match yours as a user.
And some (including myself) actually like this separation.