undefined | Better HN

0 pointsabecedarius7y ago0 comments

It’s interesting though that the way calculus is classically taught does not make this obvious.

0 comments

Hmmm, I don’t know. If you’re allowed skim over edge cases, the statement of the chain rule is pretty obvious: the composition of two linear functions is another linear function, with the coefficients multiplied.

abecedariusOP7y ago

I mean that knowing the chain rule does not, historically, imply seeing that automatic differentiation is possible and efficient.

kxyvr7y ago

It's not and I also find it a bit frustrating when the default answer is just chain rule. For me, the key insights into how AD is derived are the following:

1. There is a fundamental difference between normed spaces and inner product spaces and this affects what algorithms we can derive. Specifically, the forward mode corresponds to a version of the chain rule that only requires a normed space and not an inner product space. If we assume that we have an inner product, then we can apply the Reisz representation theorem. This is important because it means that there's a concrete element in the space that corresponds to the derivative. This is precisely the gradient. Further, we have a concrete way of finding this element. Because we have a (complete) inner product space, we have a Hilbert adjoint. The gradient, and ultimately reverse mode, can be calculated by combining the chain rule with the Hilbert adjoint to obtain the gradient. Specifically,

  (f o g)'(x) dx = f'(g(x))g'(x) dx = (f'(g(x))g'(x) dx) 1 = <f'(g(x))g'(x)dx,1> = <g'(x)dx,f'(g(x))* 1> = <dx,g'(x)*f'(g(x))*1> = <dx,g'(x)* grad f(g(x))>

Here, * represents the Hilbert adjoint. Anyway, we get away with this because the derivatives are really linear operators, which we get from the total derivative.

2. Eventually, we feed the the gradients `grad f(g(x))` into the adjoint of `g'(x)`. However, we it turns out that we can delay further delaying sending the values of the adjoint of `g'(x)` into its parent by accumulating all of the gradient information being fed to it first. Essentially, this means that we can follow a computational graph and accumulate all of the intermediate derivative information before moving on. This means that we can traverse the graph a single time in reverse mode, which is efficient. How we traverse this graph corresponds to a topological sort of the graph. This can be generated at compile time using what's called a tape. This may or may not be more efficient modulo a bunch of reasons, which aren't all that important right now.

Anyway, this is more to say that automatic differentiation is not an obvious result from basic calculus. At least not to me. For me, it required knowledge of real and functional analysis, graph theory, and knowledge of programming languages. It's not impossible to learn, but it's not trivial in my opinion.

j / k navigate · click thread line to collapse