To put that another way, in the current marketplace, what kinds of program are so worthy of optimization that it's economically sensible to have a human spend several days hand-tuning machine language to squeeze out every CPU cycle?
In addition, look at how popular netbooks are becoming. The Intel Atom is an in-order CPU. Imagine a hyperthreaded, 1.6 GHz 486...
On the iPhone it's even worse. It's got a decent vector unit, but the CPU is very slow. You'll see great wins by doing your 3D math yourself.
As we continue to become multicore, I could imagine somebody shaving a couple cycles out of the core message passing routines, though you're almost certainly bus bound in those situations...
Computers are getting smaller and people want more out of them; assembly language is back in style!
I'm guessing that most languages with built-in foreign function interfaces (like Python's ctypes) have similar thunking layers.
https://bugzilla.mozilla.org/buglist.cgi?query_format=advanc...
When I worked on it, our simulator was an order of magnitude faster than commercially available simulators (Synopsis VCS and Cadence NC-Verilog), which cost between $1k and $10k per license per year. I worked for a tiny hardware startup; established hardware companies use a few orders of magnitude more compute power than we did, so the equation is probably at least four orders of magnitude further in favor of doing assembly optimization in a commercial simulator.
You can get 5x, 10x, 20x, or more performance increases just by using the vector instructions given to you by the CPU. Until a magic compiler appears that can make proper use of them (read: never), hand-coded assembly will be critical for almost any application for which performance is critical, especially multimedia processing.
And even then you'll often end up significantly worse off than if you wrote the assembly by hand.
A run of Intel's compiler on the C versions of our DSP functions resulted in a grand total of one vectorization, which was done terribly, too.
I am told that Fortran does better than C here; there is a reason it is still widely used in the scientific computing community, after all.
This is also part of the reasoning behind C99's new restrict keyword.
If you don't write the compiler, you have to make assumptions about when and how it can/will use those instructions, and often you assume wrong, particularly across compiler upgrades.
The first reason is the whole category of optimizations that the compiler is worse than a human at (like register allocation) or cannot do effectively at all (messing with calling conventions, computed jumps).
The second reason is more subtle: in any case that you abstract yourself from some part of a problem, you inherently create a less efficient solution.
For example, intrinsics mean that you don't have to manually allocate registers. But this also means that if your algorithm uses too many registers and it would be more efficient to modify it to require fewer (and thus not need spills), you will have no way of knowing such a thing. By insulating yourself from that layer of complexity, you've also limited your ability to make higher-level optimizations that improve lower-level performance.
This applies on practically every level possible: any method of abstraction, no matter how well designed, will always in some fashion reduce the maximum performance you can achieve. Of course, this doesn't mean abstraction is bad--it provides an often-useful tradeoff between developer time and performance.
"But for the absolute core of the system—the inner loops of the index servers, for instance—very small gains in performance are worth an awful lot. When you have that many machines running the same piece of code, if you can make it even a few percent faster, then you’ve done something that has real benefits, financially and environmentally. So there is some code that you want to write in assembly language."
When publishers/developers don't give a bleep, the fans take up the task of fixing the bugs themselves. I happen to run one such project in my spare time (for C&C: Red Alert 2), and it's amazing how much stuff is broken. It's not as "serious" as other projects mentioned here, but still a reason to know ASM. (And a good way to see bad programming practices in action :) )
I'm sure people still play Master of Magic, a strategic game from 1994. I've been playing it on and off since it came out, and I began to think - is it something wrong with me that I like this old game so much? I mean, surely there must be newer games that are better. I showed it to my teenage brother in the mid 2000's and he loved it too. I had my non-computer-gaming friends blown away by the original Heroes of Might and Magic (1995).
I think the world needs better means for preservation of old computer games.
(I love what I do, but my twelve year old self would be disgusted that I'm not writing games.)
Since a lot of the bugs therein may be dependent on a certain sequence of instructions, doing it in a high level language doesn't make any sense.
For that matter, CUDA (and ATI's Bare-Metal Interface, which is similar) is more assembly-like than C-like in many ways. So even using the higher-level available language is still pretty much like assembly.
You tend to only write these things when you're going to be running a lot of elements through, so almost everything you do in these platforms is inner-loop, or you'd be using a different tool. So even small speed-ups tend to matter.
In all of xnu, not counting AES, there are ~17kloc in x86 assembly, most of it in osfmk/i386 --- where no normal developer is ever going to go. There are over 730kloc in C.
Others have covered the optimization side of things well so I won't repeat it, but there are tiny fragments of assembly all over the place -- they hold your system together.
One place I did this was various RSA Challenge attack clients.
I am now assistant-teaching a college course in low-level computer programming. It's an excellent course: the students reprogram a children's toy robot that uses the ARM processor. http://www.amazon.com/Little-Tikes-Giggles-Remote-Control/dp... They're getting up to speed very quickly on how to get hardware to actually do stuff.
Yes, I actually left Silicon Valley to do grad school. I haven't given up the principle of "do real stuff, see real results", though. I'm looking to design a couple fairly small homework assignments consisting of optimizing some ARM code. I want the examples to be real. Now mulling over which to do...
Additionally, low level hardware interfacing is often done with hand coded assembly, because it is easier to "get right" on some crappy compiler toolchains that you face, then C.
Tangentially related but not quite the same, I work at a company that makes barcode recognition software and some of our most performance-sensitive areas use assembly. It is mostly C, though.
If you want simplicity, you look at lisps; homoiconicity is perhaps the most elegant, simple concept known in computing. It may be more complex in practice (many more layers above the bare metal), but in concept it's simply beautiful.