I was asking about indirect branches, and the Branch Target Buffer... there are two main cases where you're making indirect jumps or indirect function calls in ELF C/C++ executables: shared library function calls (an IP-relative call to a Program Linkage Table (PLT) entry, which contains an indirect jump through a Global Offset Table (GOT) entry). The other common case is C++ virtual function calls, where you're loading the vtable entry into a register and immediately making an indirect function call through that register. You'd like to be able to start speculatively executing that jump (library call)/virtual function call while fetching the GOT/vtable entry is still in progress.
Using only information resident in the architectural registers and the decoded instruction stream, there's no heuristic that can accurately tell you where printf (or this->do_my_thing()) lives in the address space. If the address for printf isn't in the BTB, do you just stall the pipeline while loading the proper GOT entry? I thought that in general, they just hashed the instruction address to get the BTB entry and blindly assumed that BTB entry would be the target address.. (and of course, issue the GOT read and check the assumption as soon as the real GOT entry is read). The alternative would be to stall the whole pipeline until the GOT could be read anyway, so might as well do something... unless you're really concerned about power consumption. Even in extremely power limited situations, assuming hash collisions in the BTB are rare might be less power-hungry than checking the source address in the BTB entry on each and every indirect branch.
Storing both the source and destination addresses in the BTB entry means twice as much space used, and extra latency and power usage in checking the source address. Does anyone here have good knowledge of when it's worth it to store both the source and destination address in the branch target buffer, instead of just storing the target address and presuming hash collisions are rare?
On another side note: has anyone heard of an architecture that supports IP-relative indirect function calls? Such an addressing mode would allow making a function call through the GOT entry without having to load the PLT entry into the instruction cache. I'm guessing that since there's a fair overhead for sharded library calls, the most performance-sensitive code already avoids dynamic library calls, so there's not much to be gained in exchange for the added complexity of the extra call addressing mode.