In general, if you're using any other programming paradigm (i.e., 90+% of all software), you don't need tons of cheap threads; your application probably only needs at most one per core. In that case, you aren't constantly context switching (except to legitimately mutlitask with other programs), and you aren't using up that much kernel memory to hold thread state (because you only have a few threads, not a ton). So, you should just use the kernel as it was intended.
My assumption in reading the article is that the author was very much talking about the 90+% case (he didn't really say what he was talking about in the article).
There actually is another case though, which I think is what the author is really getting at. What if you have a small number of threads that need to be extremely high performance, and they have extremely short critical sections (which is not necessarily the common case across most applications)? Then, you would not want your threads to constantly suspend in the kernel every time there is a tiny bit of contention. You'd rather have them spin for a few cycles and just wait until they can actually get the lock... or just use hardware features (like the compare and exchange instructions) to abstract away the issue.
To do either of those, yeah, you pretty much need to code it yourself or (better) get a third party library. AFAIK. You would think pthreads would have a spinlock option that can never suspend (unlike futex, which can suspend), but if it does, I'm not aware of it.
In fact, formally speaking, there are lock-free algorithms, which don't have a lock, but can have unbounded retried before they get to access the data. And then you have wait-free algorithms, which can guarantee you get access to the data within a bounded number of retries.