Those are just image classifiers though. They associate this pixel patters with the character string "table". They might be able to recognise this image as relating to both table and person, but most likely as competing possible classifications. They generally don't have a concept of relatedness between classifications. They have no spacial conceptual model relating concepts like 'above', beside', 'below', distance, etc.
They also have no idea what a table is, no general concept of furniture, or even of a table as a physical or 3 dimensional object. They certainly have no idea how it relates to the human form or what "sitting" on something means. It's literally just pixel pattern => string("table"). That's it.
There may be some experimental models that attempt such things, but the image classifiers that get all the press these days are very good at classification but nothing else.