Stencil check rejections based on overdraw don't hurt as much as you'd think. Liken it to a clip() (or discard), which is a single instruction. The GPU pipeline optimizes for this.
The approach you linked too is very well thought out but each font still does pixel processing for a bezier curve, which is many orders more expensive than a clip(). Never mind the addition of a dependent read via the LUT and the tracing step.