According to the latest ImageNet standings [2], ViT appears to have slipped to second place in Top-1 Accuracy. CoAtNet-7 is the new leader, but only by a slight margin and at the cost of what appears to be a significantly larger model.
[1] Scaling Vision Transformers https://paperswithcode.com/paper/scaling-vision-transformers
[2] https://paperswithcode.com/sota/image-classification-on-imag...