From the mathematical point of view the literature is about the distinction between a "filtering" distribution and a "smoothing" distribution. The smoothing distribution is strictly more powerful.
In theory intuitively the smoothing distribution has access to all the information that the filtering distribution has and some additional information therefore has a minimum lower than the filtering distribution.
In practice, because the smoothing input space is much bigger, keeping the same number of parameters we may not reach a better score because with diffusion we are tackling a much harder problem (the whole problem), whereas with autoregressive models we are taking a shortcut which happens to probably be one that humans are probably biased too (communication evolved so that it can be serialized to be exchanged orally).