I'll recommend the Spotlight paper by Google[1]. There are very interesting datasets they created for this purpose. They mention they have a screen-action-screen dataset that is in-house and it doesn't look like they'll open it. Maybe owning Android has its advantages.
There's a recent paper by Huggingface called IDEFICS[2] that claims to be an open source implementation of Flamingo(an older paper about few-shot multi-modal task understanding) and I think this space will be heating up soon.
[1] https://research.google/pubs/pub52171/
[2] https://huggingface.co/blog/idefics