They are unreliable at that. They can't reliably judge LLM outputs without access to the environment where those actions are executed and sufficient time to actually get to the outcomes that provide feedback signal.
For example I was working on evaluation for an AI agent. The agent was about 80% correct, and the LLM judge about 80% accurate in assessing the agent. How can we have self correcting AI when it can't reliably self correct? Hence my idea - only the environment outcomes over a sufficient time span can validate work. But that is also expensive and risky.