It didn't get "swimming to a dollar" right either. I think it doesn't "understand" spatial descriptions unless it happens to find an image match the description.
It definitely struggles with relationships between objects, often confusing them instead (e.g. printing the baby on the dollar bill instead of swimming to it in this case)