How do computers calculate sine? (opens in new tab)

(androidcalculator.com)

227 pointsvegesm2y ago163 comments

163 comments

I recently learned how Doom was ported to the SNES. It's quite impressive. The SNES hardware was nowhere near fast enough to do all the trig calculations needed for the game but cartridge based games had a trick up their sleeve: they could include actual hardware inside the cart that the game code could make use of. It was more expensive but if you expected to sell a boatload of copies, it could be worth it. However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision. They were helped by the fact that the game resolution in this case was fairly low.

If you're interested, you can peruse the C code that was used to generate the tables. Here's the file for sine/cosine:

https://github.com/RandalLinden/DOOM-FX/blob/master/source/m...

pillusmany2y ago

Games targetting pre-Pentium PCs also used precomputed trig tables.

Pentium was fast enough that it didn't matter as much.

Just a few years later it was slower to read a trig precomputed table.

BD1032y ago

Yup, I remember watching a video about how the RAM bus is the bottleneck when running Super Mario 64 on the N64. The original implementation used trig lookup tables, but the person optimized it by instead using Taylor series (I think) and some negation / shifting.

Polycryptus2y ago

For anyone curious, the video:

https://youtu.be/xFKFoGiGlXQ

pests2y ago

Also enjoy his discovery that, while the game got flak online for years due to the release build have files that were not optimized, it turns out most of the optimizations that were done were actually for the worse due the low instruction cache/ram (I forget the exact details.) Things like unrolling loops just increased the code size and required more slow code paging.

xarope2y ago

in other words, for those of us who remember, they used the equivalent of a slide rule

fodkodrasz2y ago

More like a Trigonometry table, which predates even slide rules:

https://en.wikipedia.org/wiki/Mathematical_table

mads2y ago

Back in the 80's, when I made demos on the C64, we also used pre-calculated sines. I remember going to the library to get a book with the values.

albrewer2y ago

Abramowitz and Stegun's Handbook of Mathematical Functions[0] - I still use it to test whenever I need to implement any kind of fundamental math function that's not built into whatever environment I'm using. Haven't used it in some time but it comes in handy.

[0]https://en.wikipedia.org/wiki/Abramowitz_and_Stegun

nwellnhof2y ago

I just wrote a BASIC program generating values using the SIN function [1] and poked them into memory.

[1] https://www.c64-wiki.com/wiki/SIN

Dwedit2y ago

It turns out that SNES DOOM missed out on a big optimization that people figured out later on. If you use the SNES's Mosaic feature combined with scrolling tricks, you can nearly double the fill rate. Rather than outputting two pixels, you let the SNES's mosaic hardware do the pixel doubling.

inkyoto2y ago

> […] pre-calculated lookup tables […]

The approach is older than that. I remember my grandfather's engineering books from 1950's – nearly each of them had a large addendum with the said pre-calculated sine, cosine, tangent and logarithm lookup tables. And there was at least one book that only had such tables and no other information.

That is how engineers used to calculate before the advent of computers.

aoanla2y ago

The classic of this field of books is Abramowitz and Stegun's "Handbook of Mathematical Functions" - although the two listed names are merely those of the compilation editors, as the calculations of the numerous tables of values (and sheets of mathematical identities) required hundreds of human computers operating for years. Ironically, on publication in 1964 it was just in time to see the dawn of the electronic computer age that would supplant it.

albrewer2y ago

I still use it when testing implementations of mathematical functions. Like if all I need is a bessel function, why pull in a whole CAS to do that?

tdudhhu2y ago

Once I tested lookup tables for a path tracer (ray tracer).

It is interesting that you can get very decent results even with low quality tables. Of course there will be artifacts but due to the randomness of a path tracer this is not always very noticeable.

ImHereToVote2y ago

I always wonder when hearing about these old optimizations why they aren't used in contemporary code. Wouldn't you want to squeeze every bit of performance even on modern hardware?

csande172y ago

The "processor-memory performance gap" is a big reason why lookup tables aren't as clear of a win on modern hardware as they were on the SNES.

If it takes two CPU cycles to read from RAM, a lookup table will basically always be faster than doing the math at runtime. If it takes fifty cycles (because, while your RAM may be faster, your CPU is a lot faster), and your processor has more advanced hardware that can do more math per cycle, maybe just doing the math is faster.

1 more reply

Const-me2y ago

Some of these old optimizations are now deprecated. For example, there’s a famous trick for inverse square root: https://en.wikipedia.org/wiki/Fast_inverse_square_root Modern processors have a special instruction for that. The hardware instruction is several times faster, and couple orders of magnitude more precise: https://www.felixcloutier.com/x86/rsqrtps

Other optimizations are now applied automatically by compilers. For example, all modern compilers optimize integer division by compile-time constants, here’s an example: https://godbolt.org/z/1b8r5c5MG

Squeezing performance out of modern hardware requires doing very different things.

Here’s an example about numerical computations. On paper, each core of my CPU can do 64 single-precision FLOPs each cycle. In reality, to achieve that performance a program needs to spam _mm256_fmadd_ps instructions while only loading at most 1 AVX vector per FMA, and only storing at most 1 AVX vector per two FMAs.

tdudhhu2y ago

Artifacts are ugly. So why force it on modern hardware when GPUs are extremely fast?

For reference: I was doing a path tracer in PHP :) so yeah, that renders like ancient hardware.

(The browser requested different buckets of an image. A PHP script then rendered and returned that bucket. So it was a kind of multi-threading but still very slow.)

lambdas2y ago

A lot get antiquated by instruction additions, like the infamous inverse square root

lkschubert82y ago

Doing so costs time/wage dollars.

hakuseki2y ago

> However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision.

Is this really the order of events? I imagine the pre-calculated route is what you'd try first, and only go for extra hardware if that failed somehow.

shzhdbi09gv8ioi2y ago

Lookup tables were commonplace in 80s-90s graphics programming. It should have been used before any consideration of custom hardware solutions.

pengaru2y ago

imagine thinking lookup tables are innovative

warpech2y ago

This made me realize that trigonometric functions are not deterministic across different CPU architectures, OS, and programming languages (floating point precision aside).

E.g. I would assume that Math.sin(x) returns the same thing in NodeJS on Windows and Mac/M1, but it turns out it is necessarily so. https://stackoverflow.com/questions/74074312/standard-math-f...

retrac2y ago

Rounding transcendentals correctly has unknown time and space complexity. [1] Sort of brushes up against the halting problem. With limited precision, the upper limit becomes calculable but it's rather large - packages that offer correct rounding on 64-bit floating point use potentially hundreds of bytes to deal with a single floating point value. Dedicated circuitry to implement it fast would be big and complicated even by today's standards.

https://en.wikipedia.org/wiki/Rounding#Table-maker's_dilemma

lifthrasiir2y ago

True in general, almost false for binary64. Important univariate functions have been thoroughly mapped for known hard-to-round cases, which resulted in the fact that we only need at most triple-double format to do the correct rounding. (Bivariate functions like `pow` are much harder, and not yet fully mapped as of this writing.) As a result we now have a mathematical library that almost ensures correct rounding [1], and a further optimization is currently in progress.

[1] https://core-math.gitlabpages.inria.fr/

throwawaymaths2y ago

unknown time and space complexity

True in general but for the basic datatypes sent through hardware regusters your processor architecture has fixed precision. So the time and space complexity is O(1)

ImHereToVote2y ago

We should just use analog circuit for this sort of thing where the exact precision doesn't matter.

Arelius2y ago

Yeah, but you'd be surprised at how frequently they appear to be the same. I once worked on a HTML5 game that that relied on deterministic simulation for networking. And it wasn't untill pretty late in development that a build of Chrome shipped on some platform that finally triggered a desync in our extensive cross-platform test suite.

We implemented a deterministic approximation, and moved on. But I learned something important about trig functions that day.

cogman102y ago

Well, what's fun is that (AFAIK) trigonometric functions tend not to be implemented in the newer floating point instructions, such as AVX or SSE.

So while what you say is true about the x87 implementation of those functions, for anything targeting a machine built in the last 20 years it's likely the code will run consistently regardless the architecture (barring architecture floating point bugs, which aren't terribly uncommon in the less significant bits and when overclocking comes into play).

x86 compilers won't use x87 instructions when SSE2 and later are available. x87 is just a really weird and funky instruction set that's best left in the gutter of history.

bnprks2y ago

Sadly even SSE vs. AVX is enough to often give different results, as SSE doesn't have support for fused multiply-add instructions which allow calculation of a*b + c with guaranteed correct rounding. Even though this should allow CPUs from 2013 and later to all use FMA, gcc/clang don't enable AVX by default for the x86-64 targets. And even if they did, results are only guaranteed identical if implementations have chosen the exact same polynomial approximation method and no compiler optimizations alter the instruction sequence.

Unfortunately, floating point results will probably continue to differ across platforms for the foreseeable future.

cogman102y ago

That's a bit of a different problem IMO.

Barring someone doing a "check if AVX is available" check inside their code, binaries are generally compiled targeting either SSE or AVX and not both. You can reasonably expect that the same binary thrown against multiple architectures will have the same output.

This, of course, doesn't apply if we are talking about a JIT. All bets are off if you are talking about javascript or the JVM.

That is to say, you can expect that a C++ binary blob from the Ubuntu repo is going to get the same numbers regardless the machine since they generally will target fairly old architectures.

1 more reply

gpderetta2y ago

GCC won't use FMA without fast-math though. Even when AVX is otherwise enabled.

1 more reply

enriquto2y ago

> x87 is just a really weird and funky instruction set that's best left in the gutter of history

hmmm, can you use the long doubles in sse or avx? They are glorious, and as far as I see from playing with godbolt, they still require dirtying your hands with the x87 stack.

cogman102y ago

The 80bit float? Not as far as I'm aware. However, it's fairly trivial to represent a 127bit float with 2 64bit floats. And with the nature of AVX/SSE, you don't really take much of a performance hit for doing that as you are often operating on both parts of the double with the same instruction.

1 more reply

bee_rider2y ago

They do, however, have some intrinsics for trig functions in AVX in their compilers. Not as good as having an instruction of course.

fulafel2y ago

What about GPU ISAs?

paulddraper2y ago

> deterministic

I would use the word "consistent."

Non-determinism implies randomness.

TylerE2y ago

Safer to assume that floats are never deterministic.

jacobolus2y ago

Floats follow a clear specification which determines precisely how basic arithmetic should work. They should work the same on all popular modern platforms. (Whether specific software libraries are the same is a separate question.)

zokier2y ago

But transcendentals like sine are not part of the strictly defined basic arithmetic; they are intentionally defined with relaxed behavior.

2 more replies

saagarjha2y ago

This is not always a safe assumption (in certain scenarios floating point results being nondeterministic has the possibility to introduce bugs and security issues) and is also a kind of sad way to look at the world. The response to "I don't understand how this works" should not be to adopt an incorrect viewpoint, but to know the limitations of your understanding.

TylerE2y ago

It’s not that I don’t understand, it’s that I do. Floats are inherently lossy representations. Yes, this means the more operations you perform on a float input, the fuzzier the value is.You ignore that harsh reality at your peril. If you find engineering rigor sad, I don’t know what to tell you.

2 more replies

otabdeveloper42y ago

Irrational numbers cannot have an exact representation in digital bits. (Computers use rational numbers with modular arithmetic under the hood.)

1 more reply

throwway1203852y ago

It depends. If you're constrained to one chip and one platform you can characterize or you can estimate the characteristics of a float that matter in your application. In some applications like embedded that's actually totally fine, and modern embedded chips can often do floating point as fast or faster than they can emulate fixed point to work around floating point's drawbacks. On one project I worked on they originally wrote everything fixed point out of fear that floating point would introduce some deleterious effect. But in the end they rewrote parts of the project using floating point to no ill effect and great performance improvement. And there were features of the product that they had to strike because the rewrite needed to support them couldn't touch certain sensitive areas of the code that had been tested extensively in the 2 or 3 years of development. It would have been much better to evaluate the assumption that floats are bad early on in the project and make the decision based on real information. The heuristic they were applying ended up costing part of the product that was strategically important.

Jyaif2y ago

> constrained to one chip and one platform

and constrained to one compiler at a precise version, and one set of compiler options

aardvark1792y ago

Floats are well defined, and it is perfectly possible to reason about how algorithms based on them should behave. Few languages specify the accuracy of things like trig functions, so relying on them can be tricky, and JavaScript is particularly bad in that respect.

mhh__2y ago

They're always deterministic in some sense (and as long as your OS respects the rounding mode after a context switch properly). This might sound pedantic but it determines how we think about floats — the behaviour is specified quite exactly.

kens2y ago

Curiously, early Intel 386 processors had a bug where 32-bit multiplies were genuinely nondeterministic: some answers would depend on the voltage, frequency, temperature, and manufacturing conditions. The problem was essentially analog, a layout issue, where the signal didn't always have enough margin. (This is unrelated to the famous Pentium FDIV bug.) Until Intel got the problem fixed, they would stamp bad chips with "16 BIT S/W ONLY", while good chips were labeled "ΣΣ".

TylerE2y ago

What I mean is that the same code running on different hardware/os may not always give the same answer. It’ll be close, but you can’t always expect bit for bit identical.

ducttapecrown2y ago

They're always deterministic, just as long as physics is.

charlieyu12y ago

But why? We all know 0.1+0.2 won’t give 0.3 with floats but at least we should expect deterministic result for same numbers and same operations and same order, no?

contravariant2y ago

I don't think that's safe at all. Catastrophic cancellation would be quite a lot less catastrophic if rounding errors were random but accurate on average.

aardvark1792y ago

Somewhat annoyingly the ascribe standard only specifies that various math functions return an approximation but does not set any bounds on that approximation. So for many functions you could just return NaN and still be compliant.

layer82y ago

Isn’t NaN the one value that can’t possibly count as an approximation, because it’s not a number and unordered? ;)

aardvark1792y ago

You might think so, but if it’s not specified in the standard…

2 more replies

adgjlsfhk12y ago

some languages (e.g. Julia) provide their own math library do that you get the same results across across operating systems.

quickthrower22y ago

So you can use sin(x) for various x to tell what you are running on. Maybe even in the browser?

Karliss2y ago

There is a shader which uses similar idea to detect what kind of GPU you have https://www.shadertoy.com/view/slySWV

lifthrasiir2y ago

V8 and SpiderMonkey have converged to the same underlying library (fdlibm), partly for the interoperability, so you generally can't.

jxy2y ago

It's much clearer if you read one of the source code of the libm.

Plan 9: https://9p.io/sources/plan9/sys/src/libc/port/sin.c

Freebsd: https://cgit.freebsd.org/src/tree/lib/msun/src/k_sin.c

lifthrasiir2y ago

FreeBSD code is missing the range reduction step (it's named a "kernel" for the reason): https://cgit.freebsd.org/src/tree/lib/msun/src/e_rem_pio2.c

planede2y ago

Ah, I always referred to the musl implmenetation, but I just now realized that they copied the Freebsd one.

jxy2y ago

and the directory name is literally: `msun`

anthk2y ago

Does 9front keep the same sin.c implementation?

jxy2y ago

http://git.9front.org/plan9front/plan9front/HEAD/sys/src/lib...

microtherion2y ago

P.J. Plauger's _The Standard C Library_ provides an implementation for all functions in the (then) C standard: https://www.amazon.com/Standard-Library-P-J-Plauger/dp/01383...

toolslive2y ago

After reducing the interval, you don't want to use the Taylor series as you're building an approximation that's really good in 0 but not so good moving away from 0. It's better to use an interpolating polynomial (Chebychev comes to mind) over the whole target interval.

paulpauper2y ago

There are many ways to do this. It's not a difficult problem unless memory is constrained.

tails4e2y ago

I've seen the third order Taylor series used, but with the coefficients calculated at various offsets for a quarter wave. So you lookuo where you are in the quarter wave, then look up the 3 or 4 cofficients. This keeps the error somewhat bounded as the size of X is a small so the series does not diverge too much.

eska2y ago

You might also find this video interesting:

"Finding the BEST sine function for Nintendo 64"

https://www.youtube.com/watch?v=xFKFoGiGlXQ

eh_why_not2y ago

Anyone experienced with the Remez algorithm mentioned at the end of the article?

The degree-9 polynomial, said to be a thousand times better than the original Taylor approximation in maximum error, also appears to be very close to the Taylor series in the first place.

Rounding the Taylor coefficients to 6 digits after the decimal:

1/3! = 0.166667

1/5! = 0.008333

1/7! = 0.000198

1/9! = 0.000027(56)

The first 2 are exact, the third is 5 digits only (so 0.000190), and the fourth is more different starting from the 6th digit (0.000026019).

The delta in the 9-th order is expected if you were to truncate the Taylor series starting from the 11th order to infinity (+ x^11 / 11! - x^13/13! ...).

olooney2y ago

I don't think the polynomial given in the article was calculated via Remez. Perhaps the author merely meant it as an illustration. Here is Wolfram Alpha plotting the error of equation from the article:

https://www.wolframalpha.com/input?i=plot+sin%28x%29+-+P%28x....

You'll note that the error is small near zero, and large near pi/4, which is characteristic of a Taylor series around zero, and not the characteristic "level" oscillation characteristic of Remez approximations[1]. Note also that the polynomial only includes odd terms, which is not something I would expect from Remez unless it was run on the symmetric interval [-pi/4, pi/4].

I ran Remez for the problem and after 4 iterations obtained a degree 8 polynomial with error less than 1e-9, but it didn't look anything like the polynomial given in the article.

    1.7209863008943345e-05x^8 -0.00024575124459624625x^7 +7.194649190849227e-05x^6 +0.008268794893899754x^5 +3.425379759410762e-05x^4 -0.16667692317020713x^3 +1.5422400957890642e-06x^2 +0.9999999106213322x +8.526477071446453e-10

Although of course the first few digits of the low-order matching terms will be very similar - any polynomial approximation method will agree there because they are fitting the same function, after all. But by the time we reach x^5 or x^7 the agreement is very loose, really only the first couple digits are the same.

[1]: https://en.wikipedia.org/wiki/Approximation_theory

eh_why_not2y ago

> Perhaps the author merely meant it as an illustration.

Yes I too was expecting oscillations (since I saw Chebyshev polynomials mentioned on the wiki page). I'm beginning to think that the word "might" in their statement "One way to calculate is Remez’s algorithm. The results might look like this:" is doing a lot of heavy lifting.

A couple of notes about your results:

- The interval only needs to be [0, pi/16] and not [0, pi/4]; adjusting that in your wolframalpha query yields a similar bound to what the author lists: 3.5 e-9. (Although same observation about the change of behavior at the bounds as you made).

- The polynomial you obtained actually looks pretty close to the Taylor one: the even coefficients are small, and rounding to 5 or 6 digits might yield a close one, maybe if we look for a degree 9 polynomial instead of degree 8. (I don't have an implementation of Remez handy right now to try it myself).

> Although of course the first few digits of the low-order matching terms will be very similar

Makes sense.

stephencanon2y ago

https://en.wikipedia.org/wiki/Remez_algorithm

It’s a very simple iterative algorithm, essentially the dumbest thing that could possibly work (like most good algorithms). It fails to converge for functions that have poles nearby unless you have a very good initial guess (the Chebyshev or Carathéodory-Fejér approximants are ~always good starting points and easily computed). In practice you want to optimize a weighted L-inf norm rather than absolute, because floating-point errors are measured in a relative norm.

djmips2y ago

Robin Green goes into this in this GDC 2002 article.

https://basesandframes.files.wordpress.com/2016/05/rgreenfas...

azhenley2y ago

I blogged my adventure of implementing cosine from scratch and how others have done it:

https://austinhenley.com/blog/cosine.html

dboreham2y ago

This was my first use of open source, around 1978. I wondered how calculators and computers did trig functions, and was also using Unix V7. We had a large disk and kept the source on line. So I was able to find this: https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/libm/s... and from there this book: https://www.biblio.com/book/computer-approximations-john-f-h...

duped2y ago

1 - cos^2(x), obviously

nh23423fefe2y ago

instead, you could just double negate to optimize away the square root

    -(-(sin(x))

ChainOfFools2y ago

clear, but too verbose. 1/csc is what you want.

or for style points just -cos'

convolvatron2y ago

http://steve.hollasch.net/cgindex/math/inccos.html is a great technique if you need a fast integer approximation for some some arbitrary sampling interval (i.e. motor control)

phkahler2y ago

I've been doing FoC motor control for a long time and I've settled on a neat little fixed point approximation for sin/cos. I haven't been able to find the blog I got the idea from. It's accurate to 9 bits, but is very symmetric and hits 1,0,-1 exactly. It's also smooth which usually makes it better than lookup tables.

aidenn02y ago

Just using the 32 entry LUT and the small-angle approximations (sin(x) = x, cos(x) = 1-x^2/2) lets you calculate sin(x) within +/- 0.00015, which is rather impressive for something that can be done quickly by hand.

If you use cos(x) = 1 then you are still accurate to within 1%

[edit]

I think I also found an error in TFA? It seems like picking the "best" N would allow r to be in the range -pi/32 < r < pi/32, which makes the 3rd order Taylor series have an error of 4e-12, significantly better than the error range for 0 < r < pi/16

vardump2y ago

CORDIC is how it's usually done in hardware (and FPGAs).

https://en.wikipedia.org/wiki/CORDIC

perihelions2y ago

CORDIC is pretty obsolete, AFAIK. Its advantage is that its hardware requirements are absolutely tiny: two (?) accumulator registers, and hardware adders and shift-ers—I think that's all. No multiplication needed, in particular. Very convenient if you're building things from discrete transistors, like the some of those earlier scientific calculators!

(Also has a nice property, apparently, that CORDIC-like routines exist for a bunch of special functions and they're very similar to each other. Does anyone have a good resource for learning the details of those algorithms? They sound elegant).

pclmulqdq2y ago

CORDIC still is used in tiny microcontrollers (smaller than Cortex-M0) and in FPGAs when you are very resource-constrained. Restricting the domain and using Chebyshev/Remez is the way to go pretty much everywhere.

idatum2y ago

I just recently came across CORDIC in a fun read I just finished, 'Empire of the Sum', by Keith Houston. It's a history of calculators, and there's a chapter on the HP-35 which used CORDIC. The author goes into some details how it was used by that constrained device.

There are a lot of references given in the book including more details on CORDIC.

rcxdude2y ago

CORDIC doesn't make sense if multiplies have a similar cost to adds. If you have cheap multiplications then a successive approximation is faster. If multiplications are expensive relative to adds then CORDIC can still generally win. Basically this is only the case nowadays if you are doing ASICs or FPGAs.

derf_2y ago

I have used CORDIC to compute fixed-point log/exp with 64 bits of precision, mostly because I wanted to avoid having to implement (portable) 64x64->128 bit integer multiplication in C. It is probably still slower than doing that, but the code was really simple, and very accurate.

dilyevsky2y ago

I think still used out of necessity when hw floating point not available (like fpgas)

f1shy2y ago

Multiplication is pretty much needed in cordic! And is far from obsolete! It works perfectly fine, and dont have any of the problems said in the article.

kevin_thibedeau2y ago

CORDIC doesn't use multipliers. That's the whole appeal for low performance hardware since it's all shifts and adds. It can still be useful on more capable platforms when you want sin and cos in one operation since there is no extra cost.

perihelions2y ago

It uses multiplication by powers of two, which is a floating-point bit shift.

adgjlsfhk12y ago

no it's not. cordic has awful convergence of 1 bit per iteration. pretty much everyone uses power series.

ajross2y ago

Actually pretty much everyone implements double precision sin/cos using the same (IIRC) pair of 6th order polynomials. The same SunPro code exists unchnaged in essentially every C library everywehre. It's just a fitted curve, no fancy series definition beyond what appears in the output coefficients. One for the "mostly linear" segment where the line crosses the origin and another for the "mostly parabolic" peak of the curve.

f1shy2y ago

That is 64 iterations for a double, that is nothing!

adgjlsfhk12y ago

53, but that's still a lot more than the 5th degree polynomial that you need.

1 more reply

bigbillheck2y ago

Why would you ever use CORDIC if you had any other option?

vardump2y ago

It's great for hardware implementations, because it's simple and you get good/excellent accuracy. I wouldn't be surprised if that's still how modern x86-64 CPUs compute sin, cos, etc.

That said, last time I had to do that in software, I used Taylor series. Might not have been an optimal solution.

EDIT:

AMD's Zen 4 takes 50-200 cycles (latency) to compute sine. I think that strongly suggests AMD uses CORDIC. https://www.agner.org/optimize/instruction_tables.pdf page 130.

Same for Intel, Tiger Lake (Intel gen 11) has 60-120 cycles of latency. Page 353.

I'd guess usually ~50 cycles for Zen 4 (and ~60 for Intel) for float32, float64/float80 datatype. Denormals might also cost more cycles.

bigbillheck2y ago

They switched away from CORDIC at one point: https://www.intel.com/content/www/us/en/developer/articles/t...

(there doesn't seem to actually be a linked article there, just the summary)

1 more reply

Tade02y ago

I was truly amazed when my high school computer science teacher expanded the sine function into a Taylor series for us to implement in class.

Couldn't wrap my head around the concept until the topic was revisited in college, but idea was there and helped me understand the tradeoffs brought by "fast math" libraries I was using during high school.

t-32y ago

Link doesn't appear to be valid, but aren't these usually precalculated and stored in a lookup table?

lupire2y ago

Think about how big that lookup table would be for an 64bit double float.

They use a lookup table for nπ/16, and then a polynomial to approximate the difference to other values.

Etherlord872y ago

That would be quite a big lookup table... Half of float numbers are in -1…+1 range. You don't need the negative part, so that's a quarter of float numbers, over one billion numbers. And then some to get from 1 to π/2.

And that's only float32!

mettamage2y ago

What’s the best way to calculate it by hand?

I’m brushing up my math basics (I graduated CS while dodging the math requirements) and it frustrates me that in trig I need to remember values at all. The values such as sqrt(2)/2 make sense but how hard is it to calculate sin(5 degrees) by hand?

HenryPrickett2y ago

Use a Taylor series with a four function calculator.

0 is a decent approximation of sin near 0.

x is a better one.

x - x^3/6 is an even better one.

x - x^3/6 + x^5/120 ...

Note that x here is in radians rather than degrees so convert (degrees * pi/180) first. Repeat until you're satisfied with how many stable digits you get

empath-nirvana2y ago

https://www.youtube.com/watch?v=3d6DsjIBzJ4

_v7gu2y ago

Five is small enough that you can get away with sin(5 deg) = 5pi/180

sema4hacker2y ago

I remember seeing the source code for a version of SpaceWar! running on an Adage Graphics Terminal (a one's complement machine) around 1970 that used a precomputed sine table.

I wonder what the first program was that ever used a precomputed trig table.

simonblack2y ago

The same way you can eat an elephant: one byte at a time.

Any calculating job can be undertaken by a proper Turing machine. You just have to keep in mind the old triangle of usage: Cost, capability and speed.

If a human can calculate a sine, so can any full-blown computer.

demondemidi2y ago

I thought modern CPUs since the late 1980's used a lookup table for trig/transcendental functions. Is the LUT just an expansion of the polynomial? I never really understood how FPUs worked...

Someone2y ago

That would take way too much room. A full lookup table would have 2^64 entries of 64 bits each, at 2^70 bits of ROM.

For comparison:

- Apple’s M2 Ultra has about 134 billion transistors. That’s about 2^38.

- Avogadro’s number is about 2^79.

Reducing the argument to a small range around zero decreases that a lot, but not enough by a far stretch. There are 2^52 doubles in [0.5, 1.0), 2^52 more in [0.25, 0.5], 2^52 more in [0.125, 0.25], etc. so you’d still easily need 2^52 entries or 2^58 bits (likely way, way more)

demondemidi2y ago

They DO use a lookup table because that’s what the FDIV but came from:

https://en.m.wikipedia.org/wiki/Pentium_FDIV_bug

“It is implemented using a programmable logic array with 2,048 cells, of which 1,066 cells should have been populated with one of five values: −2, −1, 0, +1, +2.”

Not sure what you’re trying to demonstrate, they wouldn’t store every single float!! I hope don’t program. ;)

lifthrasiir2y ago

Older CPUs generally have used CORDIC (which does use LUT but that's only a part of the algorithm) due to its simplicity and compactness, while later CPUs with extensive microcode support would do the same thing as software implementations.

aidenn02y ago

TFA says they use a 32 entry LUT, then do some math on the result.

perihelions2y ago

Is this still current? The paper has a publication year of 1999.

deepthaw2y ago

Ignorant question:

Given the ridiculous number of transistors and so on we can use in CPUs and GPUs nowadays how feasible is a relatively huge trig lookup table burned into rom?

zokier2y ago

There is huge number of 64bit floats and huge portion of those are between 0..pi/2.

aidenn02y ago

A lookup table (for any function) that covered all values between 0 and 1 in single precision, would be ~4GB; there are approximately 1B values between 0 and 1, and the result of each value is 4 bytes.

Such a table for double-precision would be much, much larger.

bigbillheck2y ago

Seems like a terrible idea on latency grounds alone.

Solvency2y ago

Why isn't this just done with an industry standard lookup table these days?

paulpauper2y ago

sine is easy because the series is globally convergent and fast converging

zgs2y ago

It would also be extremely inaccurate. The x^n numerators grow very quickly and digits get lost because unlimited precision isn't available. Likewise, the n! denominators also grow rapidly. Then the series is alternating which means cancellation is happening for every added term.

If you don't believe me try for x=10.

planede2y ago

You need around 26 terms for x=10 if you do it without reduction and you want an accuratish result for double precision.

You wouldn't evaluate the terms naively from left-to-right.

x - x^3/3! + x^5/5! - ... = x * (1 - x^2/(2*3) * (1 - x^2/(4*5) * ... ) )

I just checked in python and you get a result that is around 1000*machine epsilon off. Not great, not terrible.

ot2y ago

Did you read the article? It is specifically about how the series looks simple, but the error is actually very bad if you do things naively.

paulpauper2y ago

That still makes it easier compared to computing constants in which the series are not globally convergent, like inverse trig functions. Obviously, you would have to break it apart to speed convergence.

amelius2y ago

And arcsin?

paulpauper2y ago

harder due to convergence issues

j / k navigate · click thread line to collapse

163 comments

staplung2y ago

If you're interested, you can peruse the C code that was used to generate the tables. Here's the file for sine/cosine:

https://github.com/RandalLinden/DOOM-FX/blob/master/source/m...

pillusmany2y ago

Games targetting pre-Pentium PCs also used precomputed trig tables.

Pentium was fast enough that it didn't matter as much.

Just a few years later it was slower to read a trig precomputed table.

BD1032y ago

Polycryptus2y ago

For anyone curious, the video:

https://youtu.be/xFKFoGiGlXQ

pests2y ago

xarope2y ago

in other words, for those of us who remember, they used the equivalent of a slide rule

fodkodrasz2y ago

More like a Trigonometry table, which predates even slide rules:

https://en.wikipedia.org/wiki/Mathematical_table

mads2y ago

Back in the 80's, when I made demos on the C64, we also used pre-calculated sines. I remember going to the library to get a book with the values.

albrewer2y ago

[0]https://en.wikipedia.org/wiki/Abramowitz_and_Stegun

nwellnhof2y ago

I just wrote a BASIC program generating values using the SIN function [1] and poked them into memory.

[1] https://www.c64-wiki.com/wiki/SIN

Dwedit2y ago

inkyoto2y ago

> […] pre-calculated lookup tables […]

That is how engineers used to calculate before the advent of computers.

aoanla2y ago

albrewer2y ago

I still use it when testing implementations of mathematical functions. Like if all I need is a bessel function, why pull in a whole CAS to do that?

tdudhhu2y ago

Once I tested lookup tables for a path tracer (ray tracer).

It is interesting that you can get very decent results even with low quality tables. Of course there will be artifacts but due to the randomness of a path tracer this is not always very noticeable.

ImHereToVote2y ago

I always wonder when hearing about these old optimizations why they aren't used in contemporary code. Wouldn't you want to squeeze every bit of performance even on modern hardware?

csande172y ago

The "processor-memory performance gap" is a big reason why lookup tables aren't as clear of a win on modern hardware as they were on the SNES.

1 more reply

Const-me2y ago

Squeezing performance out of modern hardware requires doing very different things.

tdudhhu2y ago

Artifacts are ugly. So why force it on modern hardware when GPUs are extremely fast?

For reference: I was doing a path tracer in PHP :) so yeah, that renders like ancient hardware.

(The browser requested different buckets of an image. A PHP script then rendered and returned that bucket. So it was a kind of multi-threading but still very slow.)

lambdas2y ago

A lot get antiquated by instruction additions, like the infamous inverse square root

lkschubert82y ago

Doing so costs time/wage dollars.

hakuseki2y ago

> However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision.

Is this really the order of events? I imagine the pre-calculated route is what you'd try first, and only go for extra hardware if that failed somehow.

shzhdbi09gv8ioi2y ago

Lookup tables were commonplace in 80s-90s graphics programming. It should have been used before any consideration of custom hardware solutions.

pengaru2y ago

imagine thinking lookup tables are innovative

warpech2y ago

This made me realize that trigonometric functions are not deterministic across different CPU architectures, OS, and programming languages (floating point precision aside).

E.g. I would assume that Math.sin(x) returns the same thing in NodeJS on Windows and Mac/M1, but it turns out it is necessarily so. https://stackoverflow.com/questions/74074312/standard-math-f...

retrac2y ago

https://en.wikipedia.org/wiki/Rounding#Table-maker's_dilemma

lifthrasiir2y ago

[1] https://core-math.gitlabpages.inria.fr/

throwawaymaths2y ago

unknown time and space complexity

True in general but for the basic datatypes sent through hardware regusters your processor architecture has fixed precision. So the time and space complexity is O(1)

ImHereToVote2y ago

We should just use analog circuit for this sort of thing where the exact precision doesn't matter.

Arelius2y ago

We implemented a deterministic approximation, and moved on. But I learned something important about trig functions that day.

cogman102y ago

Well, what's fun is that (AFAIK) trigonometric functions tend not to be implemented in the newer floating point instructions, such as AVX or SSE.

x86 compilers won't use x87 instructions when SSE2 and later are available. x87 is just a really weird and funky instruction set that's best left in the gutter of history.

bnprks2y ago

Unfortunately, floating point results will probably continue to differ across platforms for the foreseeable future.

cogman102y ago

That's a bit of a different problem IMO.

This, of course, doesn't apply if we are talking about a JIT. All bets are off if you are talking about javascript or the JVM.

That is to say, you can expect that a C++ binary blob from the Ubuntu repo is going to get the same numbers regardless the machine since they generally will target fairly old architectures.

1 more reply

gpderetta2y ago

GCC won't use FMA without fast-math though. Even when AVX is otherwise enabled.

1 more reply

enriquto2y ago

> x87 is just a really weird and funky instruction set that's best left in the gutter of history

hmmm, can you use the long doubles in sse or avx? They are glorious, and as far as I see from playing with godbolt, they still require dirtying your hands with the x87 stack.

cogman102y ago

1 more reply

bee_rider2y ago

They do, however, have some intrinsics for trig functions in AVX in their compilers. Not as good as having an instruction of course.

fulafel2y ago

What about GPU ISAs?

paulddraper2y ago

> deterministic

I would use the word "consistent."

Non-determinism implies randomness.

TylerE2y ago

Safer to assume that floats are never deterministic.

jacobolus2y ago

zokier2y ago

But transcendentals like sine are not part of the strictly defined basic arithmetic; they are intentionally defined with relaxed behavior.

2 more replies

saagarjha2y ago

TylerE2y ago

2 more replies

otabdeveloper42y ago

Irrational numbers cannot have an exact representation in digital bits. (Computers use rational numbers with modular arithmetic under the hood.)

1 more reply

throwway1203852y ago

Jyaif2y ago

> constrained to one chip and one platform

and constrained to one compiler at a precise version, and one set of compiler options

aardvark1792y ago

mhh__2y ago

kens2y ago

TylerE2y ago

What I mean is that the same code running on different hardware/os may not always give the same answer. It’ll be close, but you can’t always expect bit for bit identical.

ducttapecrown2y ago

They're always deterministic, just as long as physics is.

charlieyu12y ago

But why? We all know 0.1+0.2 won’t give 0.3 with floats but at least we should expect deterministic result for same numbers and same operations and same order, no?

contravariant2y ago

I don't think that's safe at all. Catastrophic cancellation would be quite a lot less catastrophic if rounding errors were random but accurate on average.

aardvark1792y ago

layer82y ago

Isn’t NaN the one value that can’t possibly count as an approximation, because it’s not a number and unordered? ;)

aardvark1792y ago

You might think so, but if it’s not specified in the standard…

2 more replies

adgjlsfhk12y ago

some languages (e.g. Julia) provide their own math library do that you get the same results across across operating systems.

quickthrower22y ago

So you can use sin(x) for various x to tell what you are running on. Maybe even in the browser?

Karliss2y ago

There is a shader which uses similar idea to detect what kind of GPU you have https://www.shadertoy.com/view/slySWV

lifthrasiir2y ago

V8 and SpiderMonkey have converged to the same underlying library (fdlibm), partly for the interoperability, so you generally can't.

jxy2y ago

It's much clearer if you read one of the source code of the libm.

Plan 9: https://9p.io/sources/plan9/sys/src/libc/port/sin.c

Freebsd: https://cgit.freebsd.org/src/tree/lib/msun/src/k_sin.c

lifthrasiir2y ago

FreeBSD code is missing the range reduction step (it's named a "kernel" for the reason): https://cgit.freebsd.org/src/tree/lib/msun/src/e_rem_pio2.c

planede2y ago

Ah, I always referred to the musl implmenetation, but I just now realized that they copied the Freebsd one.

jxy2y ago

and the directory name is literally: `msun`

anthk2y ago

Does 9front keep the same sin.c implementation?

jxy2y ago

http://git.9front.org/plan9front/plan9front/HEAD/sys/src/lib...

microtherion2y ago

P.J. Plauger's _The Standard C Library_ provides an implementation for all functions in the (then) C standard: https://www.amazon.com/Standard-Library-P-J-Plauger/dp/01383...

toolslive2y ago

paulpauper2y ago

There are many ways to do this. It's not a difficult problem unless memory is constrained.

tails4e2y ago

eska2y ago

You might also find this video interesting:

"Finding the BEST sine function for Nintendo 64"

https://www.youtube.com/watch?v=xFKFoGiGlXQ

eh_why_not2y ago

Anyone experienced with the Remez algorithm mentioned at the end of the article?

The degree-9 polynomial, said to be a thousand times better than the original Taylor approximation in maximum error, also appears to be very close to the Taylor series in the first place.

Rounding the Taylor coefficients to 6 digits after the decimal:

1/3! = 0.166667

1/5! = 0.008333

1/7! = 0.000198

1/9! = 0.000027(56)

The first 2 are exact, the third is 5 digits only (so 0.000190), and the fourth is more different starting from the 6th digit (0.000026019).

The delta in the 9-th order is expected if you were to truncate the Taylor series starting from the 11th order to infinity (+ x^11 / 11! - x^13/13! ...).

olooney2y ago

https://www.wolframalpha.com/input?i=plot+sin%28x%29+-+P%28x....

I ran Remez for the problem and after 4 iterations obtained a degree 8 polynomial with error less than 1e-9, but it didn't look anything like the polynomial given in the article.

    1.7209863008943345e-05x^8 -0.00024575124459624625x^7 +7.194649190849227e-05x^6 +0.008268794893899754x^5 +3.425379759410762e-05x^4 -0.16667692317020713x^3 +1.5422400957890642e-06x^2 +0.9999999106213322x +8.526477071446453e-10

[1]: https://en.wikipedia.org/wiki/Approximation_theory

eh_why_not2y ago

> Perhaps the author merely meant it as an illustration.

A couple of notes about your results:

> Although of course the first few digits of the low-order matching terms will be very similar

Makes sense.

stephencanon2y ago

https://en.wikipedia.org/wiki/Remez_algorithm

djmips2y ago

Robin Green goes into this in this GDC 2002 article.

https://basesandframes.files.wordpress.com/2016/05/rgreenfas...

azhenley2y ago

I blogged my adventure of implementing cosine from scratch and how others have done it:

https://austinhenley.com/blog/cosine.html

dboreham2y ago

duped2y ago

1 - cos^2(x), obviously

nh23423fefe2y ago

instead, you could just double negate to optimize away the square root

    -(-(sin(x))

ChainOfFools2y ago

clear, but too verbose. 1/csc is what you want.

or for style points just -cos'

convolvatron2y ago

http://steve.hollasch.net/cgindex/math/inccos.html is a great technique if you need a fast integer approximation for some some arbitrary sampling interval (i.e. motor control)

phkahler2y ago

aidenn02y ago

If you use cos(x) = 1 then you are still accurate to within 1%

[edit]

vardump2y ago

CORDIC is how it's usually done in hardware (and FPGAs).

https://en.wikipedia.org/wiki/CORDIC

perihelions2y ago

pclmulqdq2y ago

idatum2y ago

There are a lot of references given in the book including more details on CORDIC.

rcxdude2y ago

derf_2y ago

dilyevsky2y ago

I think still used out of necessity when hw floating point not available (like fpgas)

f1shy2y ago

Multiplication is pretty much needed in cordic! And is far from obsolete! It works perfectly fine, and dont have any of the problems said in the article.

kevin_thibedeau2y ago

perihelions2y ago

It uses multiplication by powers of two, which is a floating-point bit shift.

adgjlsfhk12y ago

no it's not. cordic has awful convergence of 1 bit per iteration. pretty much everyone uses power series.

ajross2y ago

f1shy2y ago

That is 64 iterations for a double, that is nothing!

adgjlsfhk12y ago

53, but that's still a lot more than the 5th degree polynomial that you need.

1 more reply

bigbillheck2y ago

Why would you ever use CORDIC if you had any other option?

vardump2y ago

It's great for hardware implementations, because it's simple and you get good/excellent accuracy. I wouldn't be surprised if that's still how modern x86-64 CPUs compute sin, cos, etc.

That said, last time I had to do that in software, I used Taylor series. Might not have been an optimal solution.

EDIT:

AMD's Zen 4 takes 50-200 cycles (latency) to compute sine. I think that strongly suggests AMD uses CORDIC. https://www.agner.org/optimize/instruction_tables.pdf page 130.

Same for Intel, Tiger Lake (Intel gen 11) has 60-120 cycles of latency. Page 353.

I'd guess usually ~50 cycles for Zen 4 (and ~60 for Intel) for float32, float64/float80 datatype. Denormals might also cost more cycles.

bigbillheck2y ago

They switched away from CORDIC at one point: https://www.intel.com/content/www/us/en/developer/articles/t...

(there doesn't seem to actually be a linked article there, just the summary)

1 more reply

Tade02y ago

I was truly amazed when my high school computer science teacher expanded the sine function into a Taylor series for us to implement in class.

t-32y ago

Link doesn't appear to be valid, but aren't these usually precalculated and stored in a lookup table?

lupire2y ago

Think about how big that lookup table would be for an 64bit double float.

They use a lookup table for nπ/16, and then a polynomial to approximate the difference to other values.

Etherlord872y ago

And that's only float32!

mettamage2y ago

What’s the best way to calculate it by hand?

HenryPrickett2y ago

Use a Taylor series with a four function calculator.

0 is a decent approximation of sin near 0.

x is a better one.

x - x^3/6 is an even better one.

x - x^3/6 + x^5/120 ...

Note that x here is in radians rather than degrees so convert (degrees * pi/180) first. Repeat until you're satisfied with how many stable digits you get

empath-nirvana2y ago

https://www.youtube.com/watch?v=3d6DsjIBzJ4

_v7gu2y ago

Five is small enough that you can get away with sin(5 deg) = 5pi/180

sema4hacker2y ago

I remember seeing the source code for a version of SpaceWar! running on an Adage Graphics Terminal (a one's complement machine) around 1970 that used a precomputed sine table.

I wonder what the first program was that ever used a precomputed trig table.

simonblack2y ago

The same way you can eat an elephant: one byte at a time.

Any calculating job can be undertaken by a proper Turing machine. You just have to keep in mind the old triangle of usage: Cost, capability and speed.

If a human can calculate a sine, so can any full-blown computer.

demondemidi2y ago

I thought modern CPUs since the late 1980's used a lookup table for trig/transcendental functions. Is the LUT just an expansion of the polynomial? I never really understood how FPUs worked...

Someone2y ago

That would take way too much room. A full lookup table would have 2^64 entries of 64 bits each, at 2^70 bits of ROM.

For comparison:

- Apple’s M2 Ultra has about 134 billion transistors. That’s about 2^38.

- Avogadro’s number is about 2^79.

demondemidi2y ago

They DO use a lookup table because that’s what the FDIV but came from:

https://en.m.wikipedia.org/wiki/Pentium_FDIV_bug

“It is implemented using a programmable logic array with 2,048 cells, of which 1,066 cells should have been populated with one of five values: −2, −1, 0, +1, +2.”

Not sure what you’re trying to demonstrate, they wouldn’t store every single float!! I hope don’t program. ;)

lifthrasiir2y ago

aidenn02y ago

TFA says they use a 32 entry LUT, then do some math on the result.

perihelions2y ago

Is this still current? The paper has a publication year of 1999.

deepthaw2y ago

Ignorant question:

Given the ridiculous number of transistors and so on we can use in CPUs and GPUs nowadays how feasible is a relatively huge trig lookup table burned into rom?

zokier2y ago

There is huge number of 64bit floats and huge portion of those are between 0..pi/2.

aidenn02y ago

Such a table for double-precision would be much, much larger.

bigbillheck2y ago

Seems like a terrible idea on latency grounds alone.

Solvency2y ago

Why isn't this just done with an industry standard lookup table these days?

paulpauper2y ago

sine is easy because the series is globally convergent and fast converging

zgs2y ago

If you don't believe me try for x=10.