1) and the atime updates (I am saying this because apache does make sure it updates them, which runs them -at least- 1ms per file they serve, and yes they accumulate these accesses, but they still cause limits and slowness) ? Caching the file will, incidentally, make watch descriptors not work anymore for other apps.
2) so it's identical. Meaning it's not an advantage of static files.
3) Actually the cpu data cache is going to be the bottleneck when you have 2x10G going into a single server (because you're using 92%+ of your data bandwidth just pumping data into the network cards). Not writing (or reading) to/from memory when you don't absolutely have to is what's going to make the difference in speed. If you get to the point where you can expand templates and compress them without hitting main memory (not that hard in chips with 16M third level cache), it will make a huge difference.
And of course this is assuming it actually fits in physical memory. If templates fit, but expanded they don't, templates win hands down (so expanding per-user stats pages into static files, for example, is guaranteed to cost you more than it buys you).
Elephants.
1) true, but it's going to cost you in roundtrips
2) yes they are. The optimization states that you optimize by having an "initialize" call, to get all data you need to render the full user's page. Then the dynamic backend calls that function, stores the result at the bottom of the GWT serving page, and boom, you go from n roundtrips to 1. Since a roundtrip is going to cost you at least 80-90 ms in any case except a local LAN, this is one of those optimizations that if you don't have them, you can forget about having your page displayed in less than 400-500ms.
Dynamic serving gets a bad rap because people are doing 10 mysql queries to a database server (meaning a different physical machine) with explicitly disabled caching and then complaining that actually takes time ... it's ridiculous. Looking at some PHP code and the speed complaints on fora that accompany it, it feels like putting your grandmother's dead grandmother behind the wheel of a ferrari and complaining it doesn't move fast.
Having small programs generate data on cue is always going to be faster than reading the actual data. I don't get why this isn't obvious to everyone. Hell, it's how (and why) compression works.