> My question is why can't LLMs included a sub-routine to check itself before answering.
Because LLMs don't work in a way for that to be possible if you operate them on their own.
Here is the debug output of my local instance of Mistral-Instruct 8x7B. The prompt from me was 'What is poop spelled backwards?'. It answered 'puoP'. Let's see how it got there starting with it processing my prompt into tokens:
'What (3195)', ' is (349)', ' po (1627)', 'op (410)', ' sp (668)', 'elled (6099)', ' backwards (24324)', '? (28804)', '\n (13)', '### (27332)', ' Response (12107)', ': (28747)', '\n (13)',
It tokenized 'poop' as two tokens: 'po', number 1627, and 'op', number 410.
Next it comes up with its response:
Generating (1 / 512 tokens) [(pu 4.43%) (The 66.62%) (po 11.96%) (p 4.99%)]
Generating (2 / 512 tokens) [(o 89.90%) (op 10.10%)]
Generating (3 / 512 tokens) [(P 100.00%)]
Generating (4 / 512 tokens) [( 100.00%)]
It picked 'pu' even though it was only a ~4% chance of being correct, then instead of picking 'op' it picked 'o'. The last token was a 100% probability of being 'P'.
Output: puoP
At no time did it write 'puoP' as a complete word nor does it know what 'puoP' is. It has no way of evaluating whether that is the right answer or not. You would need a different process to do that.