This benchmark has done a wonderful job with marketing by picking a great name. It's largely irrelevant for LLMs despite the fact it's difficult.
Consider how much of the model is just noise for a task like this given the low amount of information in each token and the high embedding dimensions used in LLMs.