I imagine a carefully crafted set of programming primitives used to build up the abstraction of a CPU…
“Every ALU operation is a trained neural network.”
Oh… oh. Fun. Just not the type of “interesting” I was hoping for.
Hmmm... I need to measure this piece of wood for cutting, let me take a picture of it and see what the ai says its measurement is instead of using a measuring tape because it is faster to use the AI.
https://github.com/howerj/muxleq (it has both, muxleq (multiplexed subleq, which is the same but mux'ing instructions being much faster) and subleq. As you can see the implementation it's trivial. Once it's compiled, you can run eforth, altough I run a tweked one with floats and some beter commands, edit muxleq.fth, set the float to 1 in that file with this example:
1 constant opt.float
The same with the classic do..loop structure from Forth, which is not
enabled by default, just the weird for..next one from EForth: 1 constant opt.control
and recompile: ./muxleq ./muxleq.dec < muxleq.fth > new.dec
run: ./muxleq new.dec
Once you have a new.dec image, you can just use that from now on.Funny enough, analog computing had the same inversion — a Gilbert cell does multiplication cheaply, while addition needs more complex summing circuits. Completely different path to the same result.
What I haven't seen discussed: if the whole CPU is neural nets, the execution pipeline is differentiable end-to-end. You could backprop through program execution. Useless for booting Linux, but potentially interesting for program synthesis — learning instruction sequences via gradient descent instead of search. Feels like that's the more promising research direction here than trying to make it fast.
The cost of communicating information through space is dealt with in fundamentally different ways here. On the CPU it is addressed directly. The actual latency is minimized as much as possible, usually by predicting the future in various ways and keeping the spatial extent of each device (core complex) as small as possible. The GPU hides latency with massive parallelism. That's why we can put them across relatively slow networks and still see excellent performance.
Latency hiding cannot deal well in workloads that are branchy and serialized because you can only have one logical thread throughout. The CPU dominates this area because it doesn't cheat. It directly targets the objective. Making efficient, accurate control flow decisions tends to be more valuable than being able to process data in large volumes. It just happens that there are a few exceptions to this rule that are incredibly popular.
This sentiment is not a recent thing. Ever since GPGPU became a thing, there have been people who first hear about it, don't understand processor architectures and get excited about GPUs magically making everything faster.
I vividly recall a discussion with some management type back in 2011, who was gushing about getting PHP to run on the new Nvidia Teslas, how amazingly fast websites will be!
Similar discussions also spring up around FPGAs again and again.
The more recent change in sentiment is a different one: the "graphics" origin of GPUs seem to have been lost to history. I have met people (plural) in recent years who thought (surprisingly long into the conversation) that I mean stable diffusion when talking about rendering pictures on a GPU.
Nowadays, the 'G' in GPU probably stands for GPGPU.
Have a CPU, GPU, FPGA, and other specific chips like Neural chips. All there with unified memory and somehow pipelining specific work loads to each chip optimally to be optimal.
I wasn't really aware people thought we would be running websites on GPUs.
I can see the same happening to the CPU. It will just take on the appropriate functionality to keep all the compute in the same chip.
It’s gonna take awhile because Nvidia et al like their moats.
How do you class systems like the PS5 that have an APU plugged into GDDR instead of regular RAM? The primary remaining issue is the limited memory capacity.
I wonder if we might see a system with GPU class HBM on the package in lieu of VRAM coupled with regular RAM on the board for the CPU portion?
How do you win moving your central controller from a 4GHz CPU to a multi-hundred-MHz single GPU core?
If we tried this, all we'd do is isolate a couple of cores in the GPU, let them run at some gigahertz, and then equip them with the additional operations they'd need to be good at coordinating tasks... or, in other words, put a CPU in the GPU.
https://www.fz-juelich.de/en/jsc/downloads/slides/bgas-bof/b...
[1]: https://breandan.net/2020/06/30/graph-computation#roadmap
You’re absolutely right! I made an arithmetic mistake there — 3 * 3 is 9, not 8. Let’s correct that: Before: EAX = 3 After imul eax, eax: EAX = 9 Thanks for catching that — the correct return value is 9.
Also, is it possible to use the GPU's ADD/MUL implementation? It is what a GPU does best.
As to why not use the ADD/MUL capabilities of the GPU itself, I guess it wasn’t in the spirit of the challenge. ;)
Most GPUs, sitting in racks in datacenters, aren't "processing graphics" anyhow.
Gross-Parallelization Units
Generative Procedure Units
Gratuitously Profiteering Unscrupulously
[0]: https://en.wikipedia.org/wiki/General-purpose_computing_on_g...
CPU = Compute
GPU = ImputeIt might be worth having a CPU that's 100 times slower (25 MHz) if 1000 of them could be run simultaneously to potentially reach a 10 times speedup for embarrassingly parallel computation. But starting from a hole that's 625000x slower seems unlikely to lead to practical applications. Still a cool project though!
```lean
inductive HumanNeed where
| retailArithmetic
| genericLinkedInPost
inductive IndustrySolution where
| commodityALU
| frontierAutocomplete
def optimal : Need → IndustrySolution
| .retailArithmetic => .commodityALU
| .genericLinkedInPost => .frontierAutocomplete
def latency : IndustrySolution → Nat
| .commodityALU => 1
| .frontierAutocomplete => 248000
theorem superbowl_ads_have_not_improved_superdope_adds :
latency (optimal .retailArithmetic) < latency .frontierAutocomplete := by
decide
```The creative thinking behind this project is truly mind boggling.
Now we know future genius models won't even need CPUs, just tensor/rectifier circuits. If they need a CPU, they will just imagine them.
A low-bit model with adaptive sparse execution might even be able to imagine with performance. Effectively, neural PGA capability.
Wow. That's cool but what happens to the regular CPU?
For that a completely different approach would be needed, e.g. by implementing something akin to qemu, where each CPU instruction would be translated into a graphic shader program. On many older GPUs, it is impossible or difficult to launch a graphic program from inside a graphic program (instead of from the CPU), but where this is possible one could obtain a CPU emulation that would be many orders of magnitude faster than what is demonstrated here.
Instead of going for speed, the project demonstrates a simpler self-contained implementation based on the same kind of neural networks used for ML/AI, which might work even on an NPU, not only on a GPU.
Because it uses inappropriate hardware execution units, the speed is modest and the speed ratios between different kinds of instructions are weird, but nonetheless this is an impressive achievement, i.e. simulating the complete Aarch64 ISA with such means.
Can't wait for someone to build a DOOM that runs entirely on GPU!
Doesn't the Raspberry Pi's GPU boot up first, and then the GPU initializes the CPU?
With this technology, we've eliminated the need for that superfluous second step.
Does it need to be?
This is all a computer does :P
We need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.
Agents, tool-integrated reasoning, even chain of thought (limited, for some math) can address this.
It's slower than real cpu code obviously but still crazy fast for 'thinking' about it. They wouldn't need to actually simulate an entire program in a never ending hot loop like a real computer. Just a few loops would explain a lot about a process and calculate a lot of precise information.
Thank you, Mr. Do-because-I-can!
Yours truly,
- GPU company CEO,
- Electric company CEO.
Built a GPU-Native UNIX OS. A full multi-process operating system running compiled C on Apple Silicon Metal:
> 25-command shell (ls, cd, cat, grep, sort, uniq, tee, cp, wc, pipes, background jobs, chaining, redirect) — ~17.5KB freestanding C compiled with aarch64-elf-gcc -O2, running entirely as ARM64 on the GPU
> Multi-process: fork/wait/pipe/dup2 via memory swapping. 1MB backing stores, up to 15 concurrent processes, round-robin scheduler, pipe blocking/wakeup, fork bomb protection, SIGTERM/SIGKILL, orphan reparenting. 28 syscalls total.
> Freestanding C runtime: malloc/free/printf/fork/wait/pipe/qsort/strtol — all on GPU
Self-hosting C compiler on Metal GPU. cc.c (~2,800 lines) compiles C→ARM64 entirely on the GPU, then executes the output on the same GPU. Three layers: host GCC → GPU compiler → GPU-compiled binary. Debugged 5 codegen bugs to get it working (UBFM encoding, LDURSW sign-extension, caller-save clobbering, array subscript type clobbering, struct lvalue handling). Supports structs, pointers, arrays, recursion, for/while/do-while, ternary, sizeof, compound assignment, bitwise, short-circuit eval. 20/20 test programs pass. Mean compile: ~50K GPU cycles. Ackermann A(3,4) runs 319K cycles of deep recursion correctly.
13+ compiled C applications on Metal:
> Crypto: SHA-256, AES-128 (ECB+CBC, 6/6 FIPS vectors pass), encrypted password vault > Games: Tetris, Snake, roguelike dungeon crawler, text adventure > VMs: Brainfuck interpreter, Forth REPL, CHIP-8 emulator > Networking: HTTP/1.0 server (TCP proxied through Python) > Neural net: MNIST classifier (784→128→10, Q8.8 fixed-point) > Tools: ed line editor, self-hosting C compiler, Game of Life
neurOS — fully neural operating system. 11 trained models running MMU (100%), TLB (99.6%), cache (99.7%), scheduler (99.2%), assembler (100%), compiler (95.2%), watchdog (100%) — zero fallback paths.
Self-compilation verified: source → neural compiler → neural assembler → neural CPU → correct results.
Timing side-channel immunity. Measured sigma=0.0000 GPU cycle variance across 270 runs of AES-128. Same code on native Apple Silicon: 47-73% CoV. No caches, no branch predictor, no speculative execution inside a dispatch. T-table timing attacks are structurally impossible.
Just reorganized the whole project — neurOS and GPU OS now live under a clean ncpu/os/ package (neuros/ and gpu/ subpackages). 850 tests passing, all verified after the reorg.
To @andreadev — the MUL>ADD inversion is still my favorite result. To @bob1029 — you're right about branchy workloads being slow (~5K IPS neural, ~4M compute), but the GPU execution model gives security properties CPUs architecturally can't provide.
This is way cooler though! Instead of efficiently running a neural network on a CPU, I can inefficiently run my CPU on neural network! With the work being done to make more powerful GPUs and ASICs I bet in a few years I'll be able to run a 486 at 100MHz(!!) with power consumption just under a megawatt! The mind boggles at the sort of computations this will unlock!
Few more years and I'll even be able to realise the dream of self-hosting ChatGPT on my own neural network simulated CPU!