"By hard coding the forward and backward methods at comptime we have some more comfort with the general correctness and expected errors we would receive if we passed in incorrectly shaped data at runtime."
This solves the issues of shape checking incredibly cleanly. Python libraries have been struggling with this for a decade. It seems like you could also extending comptime usage to also calculate allocations ahead of time.
Honestly, this whole thing makes me want to invest quite a bit of time into using Zig. Great post!
That said, I'm curious about how well the compiler can optimize matrix operations, in Zig or other, say C or Rust, and when it's worth linking in BLAS or mkl some other library. I wonder if there is a sweet spot where it's worth doing.
EDIT: the MNIST link requires authentication: http://yann.lecun.com/exdb/mnist/ How can we get access to the test run at the end?
EDIT: FixedBufferAllocator has a reset method , so maybe you just need to use that to "free" all memory on each iteration? Perhaps using that directly is enough if you know the max memory usage at compile-time... I was just thinking that it should be possible to use an arena allocator to basically tell the delegate allocator to make all memory available again without freeing it back to the OS?! If anyone knows more about allocators I would be happy to know this. https://ziglang.org/documentation/master/std/#A;std:heap.Fix...
doesn't ask for auth for me?
while (i < I * O): (i += 1) {
self.weights[i] -= 0.01 * grads[i];
}
What are the options for vector / matrix / tensor operations in zig?2. Zig has built-in @Vector types that are fixed-size data types designed to compile down to things like SIMD as efficiently as possible given that you might be asking it to do 16x operations on a CPU only supporting 8x width SIMD. You'd often write your high-level code as a runtime-known iteration count over those comptime-known vector widths.
2a. Inline assembly or inspecting the architecture before choosing the @Vector width are both options, so you can write your high-level code with that information in mine if necessary (e.g., to make bolt vector quantization work well in Zig I'm pretty sure you need to inline-assembly one of the swizzling operations).
3. You can always link to LAPACK and friends. Zig has a great c interface.
4. Matrix/tensor ops aren't built-in. That doesn't matter for a lot of what this demo shows since it'll be RAM/cache bandwidth bound, but you'd definitely need to link in or hand-code inner product routines to have better asymptotics and cache friendliness if you were doing too many large matrix multiplies.
5. Wrapping any of the above into a library would be pretty easy. That sort of code is easy to write, so I haven't looked to see what other people have made in that space, but I'm sure there's something.
I'm not aware of anything in particular that would make multi-machine computations even slightly less painful than other languages, but maybe someone can chime in here with ideas.
Inline assembly is great but support for intrinsics would be really valuable for Zig IMO.
A common example is if there's any accumulation/reduction, compilers will almost entirely fail to generate SIMD unless you use -funsafe-math-optimizations type flags, because of non-associativity of floating point. Sum of squares is the classic example (not saying that specific operation is used in NN).
Explicit vectorization (e.g., using intrinsics) is almost always a relatively simple way to get orders of magnitude speedup compared to auto-vectorization, because of the above. Also because data layouts usually need to change as well (AoS vs SoA, etc.), though NN people seem to write decent data layouts.
I don't have any experience with `#pragma omp` type approaches which may be a middle ground.
Optimization in this case largely concerns memory management in the GPU and keeping data transfer between cpu to gpu at a minimum.
Essentially a massively parallel API combined with a massively parallel processor. I'm thinking of doing an end to end tutorial about this.
GGML seems to be just one gigantic 10K LOC C file anyways...
No, the bottleneck would be not utilizing the idling GPU.
> simple, purpose written NNs for many simple applications [... as opposed to] python and cuda libraries