What kind of approach did you take? I was thinking along the lines of requiring something like rewind.ai or some program that autoscreenshots your screen at a set interval (or originally a recorded video split into several images later) and having a vision-capable model (particularly specialized in UIs) describe these set of images in order to build a dataset of images-tags-description and the like.