Yes, the 'KV cache' (imo an invented novelty, everyone was doing this before they came up with a term to make it sound cool) is an optimization so that you don't have to
recompute what the model was thinking when it was generating all the prior words every time you decode a new word.
But that's exactly what I'm saying - the model has access to what it was thinking when it generated the previous words, it does not start from scratch. If you don't have the KV cache, you still have to regenerate what it was thinking from the previous words so on the next word generation you can look back at what you were thinking from the previous words. Does that make sense? I'm not great at talking about this stuff in words