My conclusion was that "Word2Vec sucks", probably a lot of people tried the same thing and either came to that conclusion or thought they did something wrong. People don't usually publish negative results so I've never read about anybody doing it. It takes bravery. Great work!
The diagrams on this page are a disgrace in my mind
https://nlp.stanford.edu/projects/glove/
what it comes down to is that they are projecting down from a N=50 space to an N=2 space. You have a lot of dimensions to play with so if you have, say 20 points, you can find some projection where those points lie wherever you want, even if it was just a random point cloud.
It's really a lie because if they tried to map 100 cities to their ZIP codes it wouldn't work at all, that's what I found trying to make classifiers.
To be fair to word2vec (rather, word embeddings) I think both require a fair amount of sentence context.
On a semi-related note, one of the reasons I avoided tackling smells yet is because so much written about smell is in the form of perfume/cologne marketing speak. Asking gpt-4o for smells lists that "the smell of jasmine and tuberose [...] evokes the mystery and elegance of a moonlit garden". I'd hope modern models would understand that this is nonsense but I can imagine a word2vec model would end up with bizarre associations.
At most with current models, you can average embeddings together.
When you refer to averaging embeddings together, do you mean averaging a bunch of sentences/words for "male" to get a general concept vector or do you mean averaging two different words, like "royal" and "adult male", to get to the combined concept, say "king"?