Was there ever a point in the Itanium's history where there were Itaniums that ran mainstream software with better performance than equivalently priced x64 processors?
But I guess you said mainstream. So unless you count database engines, I suppose the answer is "No."
Today you can get the same vector performance using SSE4 and AVX. Almost all of Itanium's good stuff has been rolled into Xeon.
I am not an expert on computer history, but my feelings on the matter are as follows:
It's hard for certain domains, like handling millions of web requests. For most computational stuff where you're just blowing through regularly-shaped numerical computation (like for example ML, or signal processing), it's not that hard, but arguably the compilers of the time were still not quite up to it (there's a lot of neat stuff that's getting worked into the LLVM pluggable architecture these days). Of course ML wasn't really a thing back then, and intel didn't seem interested in putting itaniums into cell towers.
One way to think of the OOO and branch predict processing that current x86 (and arm) do is that they are doing on-the-fly re-JITing of the code. There is a lot of silicon dedicated to doing the right thing and avoiding highly costly branch mispredicts, etc. During itanium's heyday, there was a premium of performance over efficiency. Now everyone wants power efficiency (since that is now often a cost bottleneck). Besides which, for other reasons Itanium wasn't as power efficient as (ideally) the chosen architecture could have achieved.
So don't do it at compile time? That's really a very weak argument against the Itanium ISA, and honestly more of an argument against the AOT complication model. Take a runtime with a great JIT, like the JVM or V8, and teach it to emit instructions for the Itanium ISA. (As an added advantage these runtimes are extremely portable and can be run, with less optimizations, on other ISAs without issue.)
The problem, as always, is that nobody with money to spend ever wants to part with their existing software. (Likely written in C.) In 2001 Clang/LLVM didn't even exist, and I'm not familiar with any C compilers of the era that had so much as a rudimentary JIT.
The flaws of OOO and SpecEx are evident with the overhead required to secure a system (spectre, meltdown) in a nondeterministic computational environment, and there is certainly a power cost to effectively JITting your code on every clock cycle.
As the definition of performance is changing due to the topping out of moore's law and shifting paralellism from amdahl to gustafson, I think there is a real opportunity for non ooo, non specex in th future.
Well JITs do actually outperform AoT compiled code today. Java is faster than C in many workloads. Especially large scale server workloads with huge heaps.
Java can allocate/deallocate memory faster than C, and it can compact the heap in the process which improves locality.
Seems to me like if, in practice, JIT provided better performance then by now people would be rewriting their C/C++ code in Java and C# for speed.
It's a little bit faster, not faster by enough to matter. If you're going to rewrite C/C++ code for speed you'd go to Fortran or assembler, and even then you're unlikely to get enough of a speedup to be worth a rewrite.
New projects do use Java or C# rather than C/C++ though.
Not that it can't be done so much as getting programmers to accept it is can't be done.