I learned relatively recently that trig functions on the GPU are free if you don’t use too many of them; there’s a separate hardware pipe so they can execute in parallel with floats adds and muls. There’s still extra latency, but it’ll hide if there’s enough other stuff in the vicinity.
Yep, these intrinsics are what I was referring to, and yes the software versions won’t use the hardware trig unit, they’ll be written using an approximating spline and/or Newton’s method, I would assume, probably mostly using adds and multiplies. Note the loss of precision with these fast-math intrinsics isn’t very much, it’s usually like 1 or 2 bits at most.
I’m not totally sure but I think fast math usually comes with loss of support for denormals, which is a bit of range reduction. Note that even if they had denormals, the absolute error listed in the chart is much bigger than the biggest denorm. So you don’t lose range out at the large ends, but you might for very small numbers. Shouldn’t be a problem for sin/cos since the result is never large, but maybe it could be an issue for other ops.