It uses 100% CPU, true but when the duration of the lock is extremely small (i.e. nanoseconds->microseconds) the total CPU usage is less than arranging for an OS level context-switch.
In other words, you use it when synchronising with hardware or when implementing test-and-set primitives for higher level mechanisms. Crucially, the time that the lock is held for must be very short.
Given those restrictions and use cases you get a very efficient low latency locking mechanism.