What they are measuring, it seems, is whether LLMs can be built which will retrieve a reliable known correct answer on request. That's an information retrieval problem, and, in fact, they solve it by adding "Memory Experts" which are basically data storage.
It's not clear that this helps either replies which require synthesizing disparate information, or detecting that the training data does not contain info needed to construct a reply.