A lot of the current new set of learning is that we have the compute power to do these things in more places. It is also something that has been long done in expensive environments that many of us just don't have access to.
I watched the ML/AI bros actively ignore previous research — even when they were requested to properly cite sources they were plagiarizing — in real time. The race to publish (even for big journals) was so important that it was easier to ignore the rank dishonesty than it was to correct their misbehavior. I'm 1000x happier to not have stayed around for all that crap.
That sounds like an interesting read. Do you have the chapter or the reference to the paper that you can share ?
Regarding th crop of deep neural network research their self-serving and willful blindness has a reputation that's well deserved.
A grad student from Hinton's lab mentioned one researcher who would misspell a citation on purpose so that the citation count of the cited paper does not go up.
And it is terribly unstable numerically. f(x) and f(x+h) are very similar, h is very small. You have to expect destructive cancellation to happen. For black boxes it is the only real alternative though, you can do a bit better by taking a derivative in both directions.
I think I've seen this notion that the constraint is pureness also in documentation of autodiff libraries, but this cannot be strong enough, right?
It easy enough to come up with functions that are nowhere differentiable. So my question is, what are the actual requirements a state of the art autodiff library has for the input function and why do people focus on the pureness aspect if that is probably the least of the problems.
For example f(x, y) = xy and then defining a differentiable function g(x) = f(x, a). You can imagine “a” being a state variable.
In terms of actual requirements, something that's sufficient [0] is for every sub-component to be differentiable and for no dynamic control flow to depend on the things being differentiated. In practice, most libraries wind up requiring something like this, mostly because it's very hard to do anything else. As an example, define f(x) := 0 for floats with an even LSB and 1 for floats with an odd LSB. Define g(x) := 1 - f(x). Neither of these are meaningfully differentiable, but g(x) + f(x) is identically equal to one. Autodiff relies crucially on the fact that it can perform local transformations, and that sort of whole-program analysis is (a) impossible in general, and (b) hard even when it's possible.
For local-only autodiff (the only thing people ever do), the thing that's necessary is for every sub-component to have a derivative-like operator defined such that if the sub-components are composed into a differentiable function then the normal chain rule and other autodiff compositions of those operators is also differentiable and represents the derivative in question (along with some requirements on dynamic control flow -- they don't have to be quite as strict as I described, but it's impossible to relax in general that with local-only autodiff, so that dynamic requirement from the above paragraph is also necessary).
There are few (zero?) components where that's possible -- an adversary can always come up with a composition violating the derivative being incorrect. However, for some interesting functions (like eigenvalues and eigenvectors) in the normal way people use them, these sorts of things can be defined. E.g., the eigenvalue derivative is not unique (up to a choice of phase), but if your composition also doesn't depend on phase then you're still fine.
[0] Even for things like differentiating through a loop converging to a value, this property holds, with one meaningful exception: The error in the derivative compared with the true function you're approximating will still converge to zero with enough iterations, but that number can be much higher than you need to get the function itself to converge. You _will_, however, get the derivative of the approximation perfect.
I know autodiff isn’t lambda calculus, but the expression-based structure and evaluation rules feel similar. Couldn’t this be implemented in something like ML or Clojure? Just wondering what the custom DSL adds that existing functional languages wouldn’t already support
As to what it adds?
- It's more accessible to a wider audience (and looks like how you'd implement autodiff in most languages)
- It runs in the browser trivially (powering those demos)
- The author (potentially) didn't have to learn a new language just to get started
- Programs are not fully differentiable, or at the very least there are some crazy edge cases and dragons lurking if you attempt to make them so. A dedicated whitelist of supported operations isn't necessarily a bad design, contrasted with an implicit whitelist in Clojure (depending on the implementation of course, but there wasn't a lot of source-to-source boilerplate even in this example, so I assume the benefit of a functional language would be stripping away some of the characteristics I think are important).