Not to mention VLIW wastes CPU instruction cache for instructions that aren't ran.
It is no accident that CPUs and compilers gravitated towards RISC.
VLIW also doesn't really waste instruction cache if your compiler is being smart and aligns branches to an instruction word, though you still blow the pipeline on a branch if you miss but atleast in Microsofts case they include a way for the compiler to include a prediction which it is arguably in a better position to make. This goes double if you use profiling-guided optimization. If the claims of the Mill guys are true then even the "wasted CPU instruction cache" doesn't hurt performance.
CPUs and compilers are gravitating towards lots of things. x86 and ARM aren't the only instruction set. VLIW is healthy and very alive on a lot of DSPs. There are Russian CPUs that use VLIWs in active use. AMD GPUs used VLIW for a while (and some variants still do). You can even get VLIW-based Microcontrollers for cheap.
IMO compilers and CPUs may gravitate towards RISC in the shortterm as it is more similar to CISC in terms of complexity. VLIW needs compilers to be smart and languages to be smart too for optimal use. Rust for example would be capable of really taking advantage of VLIW but LLVM doesn't support that complexity (yet, though there is some work).
In the long term, so my prediction, VLIW will dominate by nature of being simpler, faster and more efficient.
Example: the best order to run a sequence of instructions could depend on which inputs happen to be in the L1 cache at the time. This could differ from one execution to the next. There's no way for a static compiler to get this right.
On a VLIW a lot more features would be necessarily exposed and the compiler will have to take advantage of them.
Your example can be optimized by a compiler trivially by optimizing for cache-locality, something compilers already do. It simply means that if you access memory address X and your code accesses this code elsewhere, the compiler will try to keep those two closer together.
Making a simple prediction about cache contents is trivial for compilers and as mentioned, already happens. You can use graphs to build up dependencies on memory and then reduce the distance between connected node points in the execution path. Since this is VLIW and we might be able to tell the CPU which branch is likely we can even not do this in favor of optimizing the happy path better.
A modern optimizer is a very complex beast, it can certainly know some things about the state of the program during runtime and it will make some assumptions about it (enable -O3 if you want to test). Most certainly it is able to optimize your example in atleast a minimal fashion on more aggressive settings.
The CPU pipeline to my knowledge does not optimize by L1 cache as checking contents of the L1 cache is still rather expensive and the lookahead in the command queue is usually limited to a few hundred instructions. Hitting L1 is still a magnitude slower than hitting a register and very expensive to do for every memory access instruction. The pipeline tends to favor using branch predictors and register dependencies, which is simpler and faster, as well as some historical data about previous code run.
Nah. You can have variable length VLIW instructions.
And you can design an out of order VLIW that still has significant advantages at decoding many operations at once.