From the introduction, it seems like their main concern is that "smart speakers record audio from their environment and potentially share this data with other parties over the Internet—even when they should not". They provide two examples of how that happens:
1. "Smart speaker vendors or third-parties may infer users’ sensitive physical (e.g., age, health) and psychological (e.g., mood, confidence) traits from their voice."
2. "The set of questions and commands issued to a smart speaker can reveal sensitive information about users’ states of mind, interests, and concerns."
They also mention that "smart speaker platforms host malicious third-party apps", and "record users’ private conversations without their knowledge", but that's mentioned as examples of prior research and thus seems to serve more as background than something this paper is trying to prove.
Point 2 is the one you're focusing on, and yeah, that's not surprising. You'd expect Amazon to build a profile on you based on the stuff you ask Echo to do (though the ethics of this certainly warrants discussion).
Point 1 would be the surprising thing, that smart speakers infer information about people from their voice, rather than from the commands themselves.
Their methodology seems to be to create multiple personas and compare the sorts of ads they get. In order to prove that information is inferred from traits of the voice rather than the words in the commands, they would need two personas which are identical in which commands they send but with different voices (female vs male voice, healthy vs smoker voice, something like that). From skimming section 3, it doesn't seem like they did that, so I'm forced to agree that the thing they prove in this paper (if their statistical methods are valid) is that Amazon builds an advertisement profile based on your interests as expressed in terms of which commands you're sending the device.