Well, then I guess it's not clear enough for you. For me it is clear as day. The Actor model is inherently focused on concurrently processing messages acting on objects, and the very section you refer to when it talks of objects shows an example mentioning actors and it's followed by a section on Actors.
The inbox/outbox terminology is further inherently tied to async message passing with a message pump/queue, and so on.
You may not be sure it was conceptualised this way, but it really is irrelevant whether or not it was conceptualised this way because the spec-as-written describes an architecture that to someone with experience in the field is most reasonably interpreted that way as it is inherent to the vocabulary used, and where interpreting it that way creates a significantly more cohesive architecture. You're free to interpret it otherwise, but it's now been pointed out to you that treating it this way simplifies the understanding of it - if you choose to ignore that advice, then so be it.
I might be sympathetic to the argument that it is may not be spelled out sufficiently clearly to someone without experience in this area or unfamiliar with the actor model, but there's a question of to what extent a w3c working group should feel a need to assume readers are not familiar with these kinds of models..