I am quite confident that it wouldn't be too hard to build an ad detection ML model that would have near-perfect accuracy. That said, an approach based on algorithmically detecting repeated segments of lengths consistent with ad spots would work just as well, if not better.
P.S. One thing I thought was really interesting was that the classifier -- that was only ever shown a binary label (ad/not-an-ad) -- learnt an embedding that grouped together entire categories of things across TV networks and geographies (studio news, weather, traffic reports etc).
I like the idea of looking at pixels, just because that's the sort of info that gets sent down the HDMI cable and will always be available.
To your question on segment lengths, ad spots have specific, predefined duration. In the US these are typically 15s, 30s and 60s (sometimes 45s). This property could be exploited to detect ads. Consider, for example, a video segment that's exactly 30s in duration and is repeated many times over multiple TV channels. It is very likely to be an ad.