This is true but not relevant to the discussion: whatever side channel problems you'd have in pure assembly, you're practically certain to have more of them if you implement crypto in a high-level language with an aggressive optimizer.
Is there any way to write programs that bypass the processor's front end for most the processors in use? If not, it what sense is it true that "you can go further than assembler"?
This is often, though not always, the case with deeply embedded processors of the kind BearSSL seems to be designed for. There's generally some way to get predictable cycle-exact performance on them because it matters for some embedded applications.