> This probably means my test is a little too niche.
> my python one needs to be down weighted or supplanted.
To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.
I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".