Differentiable Programming – A Simple Introduction (opens in new tab)

(assemblyai.com)

159 pointsdylanbfox4y ago49 comments

49 comments

The most interesting thing I've seen on AD is "The simple essence of automatic differentiation" (2018) [1]. See past discussion [2], and talk [3]. I think the main idea is that by compiling to categories and pairing up a function with its derivative, the pair becomes trivially composable in forward mode, and the whole structure is easily converted to reverse mode afterwards.

[1]: https://dl.acm.org/doi/10.1145/3236765

[2]: https://news.ycombinator.com/item?id=18306860

[3]: Talk at Microsoft Research: https://www.youtube.com/watch?v=ne99laPUxN4 Other presentations listed here: https://github.com/conal/essence-of-ad

tome4y ago

> the whole structure is easily converted to reverse mode afterwards.

Unfortunately it's not. Elliot never actually demonstrates in the paper how to implement such an algorithm, and it's very hard to write compiler transformations in "categorical form".

(Disclosure: I'm the other of another paper on AD.)

orbifold4y ago

I think JAX effectively demonstrates that this is indeed possible. The approach they use is to first linearise the JAXPR and then transpose it, pretty much in the same fashion as the Elliot paper did.

tome4y ago

A JAXPR is a normal functional-style expression tree, with free variables, like

    let b = f a in g (a, b)

Elliot's "compiling to categories" requires you to translate this to

    g ∘ (id × f) ∘ dup

It's pretty baffling to work with terms like the latter in practice! The main ideas that JAX is based on were already published in "Lambda the ultimate backpropagator" in 2008. Elliot's work is a nice way of conceptualising the AD transformations and understanding how they all relate to each other, but it's not particularly practical.

amkkma4y ago

which paper?

yauneyz4y ago

My professor has talked about this. He thinks that the real gem of the deep learning revolution is the ability to take the derivative of arbitrary code and use that to optimize. Deep learning is just one application of that, but there are tons more.

SleekEagle4y ago

That's part of why Julia is so exciting! Building it specifically to be a differentiable programming language opens so many doors ...

mountainriver4y ago

Julia wasn’t really built specifically to be differentiable, it was just built in a way that you have access to the IR, which is what zygote does. Enzyme AD is the most exciting to me because any LLVM language can be differentiable

SleekEagle4y ago

Ah I see, thank you for clarifying. And thank you for bringing Enzyme to my attention - I've never seen it before!

1 more reply

melony4y ago

I am just happy that the previously siloed fields of operations research and various control theory sub-disciplines are now incentivized to pool their research together thanks to the funding in ML. Also many expensive and proprietary optimization software in industry are finally getting some competition.

SleekEagle4y ago

Hm I didn't know different areas of control theory were siloed. Learning about control theory in graduate school was awesome and it seems like a field that would benefit from ML a lot. I know they use RL agents for control networks for e.g. cartpole, but I would've thought it would be more widespread! Do you think the development of Differentiable Programming (i.e. the observation of more generality beyond pure ML/DL) was really the missing piece?

Also, just curious, what are your studies in?

melony4y ago

Control theory has a very, very long parallel history alongside ML. ML, specifically probabilistic and reinforcement learning, uses a lot of dynamic programming ideas and Bellman equations in its theoretical modeling. Lookup the term cybernetics, it is an old term in the pre-internet era to mean control theory and optimization. The Soviets even had a grand scheme to build networked factories that could be centrally optimized and resource allocated. Their Slavic communist AWS-meets-Walmart efforts spawned a Nobel laureate; Kantorovich was given the award for inventing linear programming.

Unfortunately the CS field is only just rediscovering control theory while it has been a staple of EE for years. However, there haven't been many new innovations in the field until recently when ML became the new hottest thing.

2 more replies

potbelly834y ago

How do you differentiate a string? Enum?

6gvONxR4sf7o4y ago

The answer to that is a huge part of the NLP field. The current answer is that you break down the string into constituent parts and map each of them into a high dimensional space. “cat” becomes a large vector whose position is continuous and therefore differentiable. “the cat” probably becomes a pair of vectors.

nerdponx4y ago

It's weirder than that. You typically are differentiating a loss function of strings and various opaque weights. You are optimizing the loss function over the weight space, so in some informal contrived sense you are actually differentiating with respect to the string.

geysersam4y ago

Not all functions are differentiable.

Sometimes there are other better ways to describe "how does changing x affect y". Derivatives are powerful but they are not the only possible description of such relationships.

I'm very excited for what other things future "compilers" will be able to do to programs besides differentiation. That's just the beginning.

titanomachy4y ago

If you were dealing with e.g. English words rather than arbitrary strings, one approach would be to treat each word as a point in n-dimensional space. Then you can use continuous (and differentiable) functions to output into that space.

bmc75054y ago

https://en.wikipedia.org/wiki/Brzozowski_derivative

adgjlsfhk14y ago

generally you consider them to be piecewise constant.

tome4y ago

Or more precisely, discrete.

choeger4y ago

Nice article, but the intro is a little lengthy.

I have one remark, though: If your language allows for automatic differentiation already, why do you bother with a neural network in the first place?

I think you should have a good reason why you choose a neural network for your approximation of the inverse function and why it has exactly that amount of layers. For instance, why shouldn't a simple polynomial suffice? Could it be that your neural network ends up as an approximation of the Taylor expansion of your inverse function?

ChrisRackauckas4y ago

It's a good question, which I answer in much more depth here: https://www.reddit.com/r/MachineLearning/comments/ryw53x/d_f... . Specifically that answer is about NNs vs fourier series, but it's the same point: polynomials, Fourier series, and NNs are all universal function approximators, so why use a NN? If you write out what happens though, you can easily see the curse of dimensionality in action and see why a 100-dimensional function shouldn't be approximated by tensor products of polynomials. See the rest there.

So it's not really about NNs vs polynomials/fourier/Chebyshev, etc., but rather what is the right thing to use at a given time. In the universal differential equations paper (https://arxiv.org/abs/2001.04385), we demonstrate that some examples of automatic equation discovery are faster using Fourier series than NNs (specifically there the discovery of a semilinear partial differential equation). That doesn't say that NNs are a bad way to do all of this, it just means that they are one trainable object that is good for high dimensional approximations, but libraries should allow you to easily move between classical basis functions and NNs to best achieve the highest performance.

choeger4y ago

Great answer! Very much appreciated!

SleekEagle4y ago

I think for more complicated examples like RL control systems a neural network is the natural choice. If you can incorporate physics into your world model then you'd need differentiable programming + NNs, right? Or am I misunderstanding the question.

If you're talking about the specific cannon problem, you don't need to do any learning at all you can just solve the kinematics, so in some sense you could ask why you're using any approximation function,

simulate-me4y ago

Requiring an exact justification for a specific NN architecture is not productive. Simply put, such justification almost never exists, yet NN clearly out-perform simple polynomials in a wide variety of tasks. Even if you know that a NN performs better than a simple polynomial on a task, you're very unlikely to get an exact explanation on why a specific NN was chosen vs. the infinite number of other possible NNs. If you require such an explanation, then you're going to miss out on a lot of performance.

potatoman224y ago

Most of the time we hear about neural networks because the linear models didn't work.

adgjlsfhk14y ago

in low dimensional spaces, there are a lot of good techniques, but once you go above 10 or so, NN is generally the only option. Pretty much everything else has exponential degradation with more dimensions.

PartiallyTyped4y ago

The nice thing about differentiable programming is that we can use all sorts of different optimizers compared to gradient descent that can offer quadratic convergence instead of linear!

SleekEagle4y ago

Yes exactly! This is huge. Hessian optimization is really easy with JAX, haven't tried it in Julia though

ChrisRackauckas4y ago

Here's Hessian-Free Newton-Krylov on neural ODEs with Julia: https://diffeqflux.sciml.ai/dev/examples/second_order_adjoin... . It's just standard tutorial stuff at this point.

PartiallyTyped4y ago

And very fast given that you compile the procedure! I am considering writing an article on this and posting it here because I have seen enormous improvements over non jitted code, and that excluded jax.vmap.

SleekEagle4y ago

There's a comparison of JAX with PyTorch for Hessian calculation here!

https://www.assemblyai.com/blog/why-you-should-or-shouldnt-b...

Would definitely be interested in an article like that if you decide to write it

applgo4434y ago

Why can't we use this quadratic convergence in deep learning?

PartiallyTyped4y ago

Well, quadratic convergence usually requires the Hessian, or an approximation of it, and that's difficult to get in deep learning due to memory constrains, and difficulty computing second order derivatives.

Computing the derivatives is not very difficult with e.g. Jax, but ... you get back to the memory issue. The Hessian is a square matrix, so in Deep Learning, if we have a million of parameters, then the Hessian is a 1 trillion square matrix...

tome4y ago

Not only does it have 1 trillion elements, you also have to invert it!

2 more replies

ssivark4y ago

To add, one could think of schemes like "momentum" and cousins as attempts to estimate something in the spirit of the inverse Hessian using various hacks/heuristics.

fennecs4y ago

Does someone have an example where the ability to “differentiate” a program gets you something interesting?

I understand perfectly what it means for a neural network, but how about more abstract things.

Im not even sure as currently presented, the implementation actually means something. What is the derivative of a function like List, or Sort or GroupBy etc? These articles all assume that somehow it just looks like derivative from calculus somehow.

Approximating everything as some non smooth real function doesn’t seem entirely morally correct. A program is more discrete or synthetic. I think it should be a bit more algebraic flavoured, like differentials over a ring.

fghorow4y ago

At first glance, this approach appears to re-invent an applied mathematics approach to optimal control. There, one writes a generalized Hamiltonian, from which forward and backward-in-time paths can be iterated.

The Pontryagin maximum (or minumum, if you define your objective function with a minus sign) principle is the essence to that approach to optimal control.

SleekEagle4y ago

I've never heard of Pontryagin's maximum principle, but thanks for bringing it to my attention. I think my knowledge of dynamical systems and control theory isn't quite up-to-snuff at the moment to fully understand it, but it's bookmarked for another day! thanks again

noobermin4y ago

The article is okay but it would have helped to have labelled the axes of the graphs.

j / k navigate · click thread line to collapse

49 comments

infogulch4y ago

[1]: https://dl.acm.org/doi/10.1145/3236765

[2]: https://news.ycombinator.com/item?id=18306860

[3]: Talk at Microsoft Research: https://www.youtube.com/watch?v=ne99laPUxN4 Other presentations listed here: https://github.com/conal/essence-of-ad

tome4y ago

> the whole structure is easily converted to reverse mode afterwards.

Unfortunately it's not. Elliot never actually demonstrates in the paper how to implement such an algorithm, and it's very hard to write compiler transformations in "categorical form".

(Disclosure: I'm the other of another paper on AD.)

orbifold4y ago

tome4y ago

A JAXPR is a normal functional-style expression tree, with free variables, like

    let b = f a in g (a, b)

Elliot's "compiling to categories" requires you to translate this to

    g ∘ (id × f) ∘ dup

amkkma4y ago

which paper?

yauneyz4y ago

SleekEagle4y ago

That's part of why Julia is so exciting! Building it specifically to be a differentiable programming language opens so many doors ...

mountainriver4y ago

SleekEagle4y ago

Ah I see, thank you for clarifying. And thank you for bringing Enzyme to my attention - I've never seen it before!

1 more reply

melony4y ago

SleekEagle4y ago

Also, just curious, what are your studies in?

melony4y ago

2 more replies

potbelly834y ago

How do you differentiate a string? Enum?

6gvONxR4sf7o4y ago

nerdponx4y ago

geysersam4y ago

Not all functions are differentiable.

Sometimes there are other better ways to describe "how does changing x affect y". Derivatives are powerful but they are not the only possible description of such relationships.

I'm very excited for what other things future "compilers" will be able to do to programs besides differentiation. That's just the beginning.

titanomachy4y ago

bmc75054y ago

https://en.wikipedia.org/wiki/Brzozowski_derivative

adgjlsfhk14y ago

generally you consider them to be piecewise constant.

tome4y ago

Or more precisely, discrete.

choeger4y ago

Nice article, but the intro is a little lengthy.

I have one remark, though: If your language allows for automatic differentiation already, why do you bother with a neural network in the first place?

ChrisRackauckas4y ago

choeger4y ago

Great answer! Very much appreciated!

SleekEagle4y ago

simulate-me4y ago

potatoman224y ago

Most of the time we hear about neural networks because the linear models didn't work.

adgjlsfhk14y ago

PartiallyTyped4y ago

The nice thing about differentiable programming is that we can use all sorts of different optimizers compared to gradient descent that can offer quadratic convergence instead of linear!

SleekEagle4y ago

Yes exactly! This is huge. Hessian optimization is really easy with JAX, haven't tried it in Julia though

ChrisRackauckas4y ago

Here's Hessian-Free Newton-Krylov on neural ODEs with Julia: https://diffeqflux.sciml.ai/dev/examples/second_order_adjoin... . It's just standard tutorial stuff at this point.

PartiallyTyped4y ago

SleekEagle4y ago

There's a comparison of JAX with PyTorch for Hessian calculation here!

https://www.assemblyai.com/blog/why-you-should-or-shouldnt-b...

Would definitely be interested in an article like that if you decide to write it

applgo4434y ago

Why can't we use this quadratic convergence in deep learning?

PartiallyTyped4y ago

tome4y ago

Not only does it have 1 trillion elements, you also have to invert it!

2 more replies

ssivark4y ago

To add, one could think of schemes like "momentum" and cousins as attempts to estimate something in the spirit of the inverse Hessian using various hacks/heuristics.

fennecs4y ago

Does someone have an example where the ability to “differentiate” a program gets you something interesting?

I understand perfectly what it means for a neural network, but how about more abstract things.

fghorow4y ago

The Pontryagin maximum (or minumum, if you define your objective function with a minus sign) principle is the essence to that approach to optimal control.

SleekEagle4y ago

noobermin4y ago

The article is okay but it would have helped to have labelled the axes of the graphs.

j / k navigate · click thread line to collapse