[1]: https://dl.acm.org/doi/10.1145/3236765
[2]: https://news.ycombinator.com/item?id=18306860
[3]: Talk at Microsoft Research: https://www.youtube.com/watch?v=ne99laPUxN4 Other presentations listed here: https://github.com/conal/essence-of-ad
Unfortunately it's not. Elliot never actually demonstrates in the paper how to implement such an algorithm, and it's very hard to write compiler transformations in "categorical form".
(Disclosure: I'm the other of another paper on AD.)
Also, just curious, what are your studies in?
Sometimes there are other better ways to describe "how does changing x affect y". Derivatives are powerful but they are not the only possible description of such relationships.
I'm very excited for what other things future "compilers" will be able to do to programs besides differentiation. That's just the beginning.
I have one remark, though: If your language allows for automatic differentiation already, why do you bother with a neural network in the first place?
I think you should have a good reason why you choose a neural network for your approximation of the inverse function and why it has exactly that amount of layers. For instance, why shouldn't a simple polynomial suffice? Could it be that your neural network ends up as an approximation of the Taylor expansion of your inverse function?
So it's not really about NNs vs polynomials/fourier/Chebyshev, etc., but rather what is the right thing to use at a given time. In the universal differential equations paper (https://arxiv.org/abs/2001.04385), we demonstrate that some examples of automatic equation discovery are faster using Fourier series than NNs (specifically there the discovery of a semilinear partial differential equation). That doesn't say that NNs are a bad way to do all of this, it just means that they are one trainable object that is good for high dimensional approximations, but libraries should allow you to easily move between classical basis functions and NNs to best achieve the highest performance.
If you're talking about the specific cannon problem, you don't need to do any learning at all you can just solve the kinematics, so in some sense you could ask why you're using any approximation function,
Computing the derivatives is not very difficult with e.g. Jax, but ... you get back to the memory issue. The Hessian is a square matrix, so in Deep Learning, if we have a million of parameters, then the Hessian is a 1 trillion square matrix...
I understand perfectly what it means for a neural network, but how about more abstract things.
Im not even sure as currently presented, the implementation actually means something. What is the derivative of a function like List, or Sort or GroupBy etc? These articles all assume that somehow it just looks like derivative from calculus somehow.
Approximating everything as some non smooth real function doesn’t seem entirely morally correct. A program is more discrete or synthetic. I think it should be a bit more algebraic flavoured, like differentials over a ring.
The Pontryagin maximum (or minumum, if you define your objective function with a minus sign) principle is the essence to that approach to optimal control.