I checked the code and found the issue. It's a result of Gemini's larger context window.
Basically, the Foxtrot scraping library sends the page in chunks. The chunk size is capped at the max context length of each model, which for Gemini is 1,000,000 input tokens for lite. That's compared to 128,000 for GPT-4o-mini.
Typically, you won't need all the tokens in the page, and sending a million tokens when 100,000 will work is wasteful in terms of cost and runtime, and can also hurt accuracy.
I'm going to re-run the benchmarks with a cap on the prompt size for models like Gemini.