A CPU that runs entirely on GPU (opens in new tab)

(github.com)

272 pointscypres2mo ago131 comments

131 comments

jagged-chisel2mo ago

“A CPU that runs entirely on the GPU”

I imagine a carefully crafted set of programming primitives used to build up the abstraction of a CPU…

“Every ALU operation is a trained neural network.”

Oh… oh. Fun. Just not the type of “interesting” I was hoping for.

mamaluigie2mo ago

Get used to it. The modern day solution for everything right now is to throw AI at it.

Hmmm... I need to measure this piece of wood for cutting, let me take a picture of it and see what the ai says its measurement is instead of using a measuring tape because it is faster to use the AI.

sdwr2mo ago

That honestly sounds great! If it works...

theamk2mo ago

It works great!

(At least 90% of the time.. the other 10% it will be slightly off, and your items will come out crooked. But don't worry, there is a tiny gray disclaimer about AI making mistakes and that you need to double-chrck it, so it's not AI's fault)

cwmoore2mo ago

Of course it works. Make a video with the tape measure, call yourself a Creator, then you can hire real carpenters.

jagged-chisel2mo ago

We already have this on our phones without AI. What could AI possibly bring to this?

fragmede2mo ago

It does? Throw the picture at ChatGPT and see what it does with it

koolala2mo ago

Isn't it interesting it doesn't instantly crash from a precision error? That sounds carefully crafted to me.

jagged-chisel2mo ago

Interesting, yes. Still not the kind of interesting I was expecting.

amelius2mo ago

Is it emulating a Pentium processor? :)

vessenes2mo ago

ARM64(!?!) I know you were joking, but still.

robertcprice12mo ago

Please tell me what you had in mind so I can try something different!

anthk2mo ago

Begin reimplementing a subleq/muxleq VM with GPU primitive commands:

https://github.com/howerj/muxleq (it has both, muxleq (multiplexed subleq, which is the same but mux'ing instructions being much faster) and subleq. As you can see the implementation it's trivial. Once it's compiled, you can run eforth, altough I run a tweked one with floats and some beter commands, edit muxleq.fth, set the float to 1 in that file with this example:

     1 constant opt.float

The same with the classic do..loop structure from Forth, which is not enabled by default, just the weird for..next one from EForth:

     1 constant opt.control

and recompile:

     ./muxleq ./muxleq.dec < muxleq.fth > new.dec

run:

       ./muxleq new.dec

Once you have a new.dec image, you can just use that from now on.

1 more reply

Retr0id2mo ago

I was imagining something more like Xeon Phi

andreadev2mo ago

The bit about multiplication being ~12x faster than addition is worth pausing on. In silicon, addition is the "easy" operation — but here the complexity hierarchy completely inverts. Makes sense once you think about it: multiplication decomposes into parallel byte-pair lookups (which neural nets handle trivially as table approximation), while addition has a sequential carry chain you can't fully parallelize away.

Funny enough, analog computing had the same inversion — a Gilbert cell does multiplication cheaply, while addition needs more complex summing circuits. Completely different path to the same result.

What I haven't seen discussed: if the whole CPU is neural nets, the execution pipeline is differentiable end-to-end. You could backprop through program execution. Useless for booting Linux, but potentially interesting for program synthesis — learning instruction sequences via gradient descent instead of search. Feels like that's the more promising research direction here than trying to make it fast.

bob10292mo ago

A fun experiment but I wonder how many out there seriously think we could ever completely rid ourselves of the CPU. It seems to be a rising sentiment.

The cost of communicating information through space is dealt with in fundamentally different ways here. On the CPU it is addressed directly. The actual latency is minimized as much as possible, usually by predicting the future in various ways and keeping the spatial extent of each device (core complex) as small as possible. The GPU hides latency with massive parallelism. That's why we can put them across relatively slow networks and still see excellent performance.

Latency hiding cannot deal well in workloads that are branchy and serialized because you can only have one logical thread throughout. The CPU dominates this area because it doesn't cheat. It directly targets the objective. Making efficient, accurate control flow decisions tends to be more valuable than being able to process data in large volumes. It just happens that there are a few exceptions to this rule that are incredibly popular.

st_goliath2mo ago

> I wonder how many out there seriously think we could ever completely rid ourselves of the CPU. It seems to be a rising sentiment.

This sentiment is not a recent thing. Ever since GPGPU became a thing, there have been people who first hear about it, don't understand processor architectures and get excited about GPUs magically making everything faster.

I vividly recall a discussion with some management type back in 2011, who was gushing about getting PHP to run on the new Nvidia Teslas, how amazingly fast websites will be!

Similar discussions also spring up around FPGAs again and again.

The more recent change in sentiment is a different one: the "graphics" origin of GPUs seem to have been lost to history. I have met people (plural) in recent years who thought (surprisingly long into the conversation) that I mean stable diffusion when talking about rendering pictures on a GPU.

Nowadays, the 'G' in GPU probably stands for GPGPU.

ecshafer2mo ago

The dream I think has always been heterogeneous computing. The closest here I think is probably apple with their multi-core cpus with different cores, and a gpu with unified memory. (someone with more knowledge of computer architecture could probably correct me here).

Have a CPU, GPU, FPGA, and other specific chips like Neural chips. All there with unified memory and somehow pipelining specific work loads to each chip optimally to be optimal.

I wasn't really aware people thought we would be running websites on GPUs.

fulafel2mo ago

The field explored this direction before in vector computers with high bandwidth memory (Cray etc).

volemo2mo ago

I see us not getting rid of CPU, but CPU and GPU being eventually consolidated in one system of heterogeneous computing units.

nine_k2mo ago

CPU and GPU have very different ways of scheduling instructions, requiring somehow different interfaces and programming models.. I'd hazard to say that a GPU and CPU with unified memory access (like the Apple's M series, and most mobile chips) is already such a consolidated system.

amelius2mo ago

nVidia Jetson also has unified memory access btw.

jagged-chisel2mo ago

Agreed. Much like “RISC is gonna replace everything” - it didn’t. Because the CPU makers incorporated lessons from RISC into their designs.

I can see the same happening to the CPU. It will just take on the appropriate functionality to keep all the compute in the same chip.

It’s gonna take awhile because Nvidia et al like their moats.

StilesCrisis2mo ago

CISC only survived because CPUs now dedicate a ton of silicon to decoding the CISC stream into RISC-y microcode. RISC CPUs can avoid this completely, but it turns out backwards compatibility was important to the market and the transistor cost of "instruction decode" just adds like +1 pipeline depth or something.

2 more replies

zozbot2342mo ago

> It will just take on the appropriate functionality to keep all the compute in the same chip.

So, an iGPU/APU? Those exist already. Regardless, the most GPU-like CPU architecture in common use today is probably SPARC, with its 8-way SMT. Add per-thread vector SIMD compute to something like that, and you end up with something that has broadly similar performance constraints to an iGPU.

junon2mo ago

We're getting there already with e.g. Grace-Blackwell chips.

fc417fc8022mo ago

> I wonder how many out there seriously think we could ever completely rid ourselves of the CPU.

How do you class systems like the PS5 that have an APU plugged into GDDR instead of regular RAM? The primary remaining issue is the limited memory capacity.

I wonder if we might see a system with GPU class HBM on the package in lieu of VRAM coupled with regular RAM on the board for the CPU portion?

chris_money2022mo ago

I don’t think the remaining issue is memory capacity. CPUs are designed to handle nonlinear memory access and that is how all modern software targeting a CPU is written. GPUs are designed for linear memory access. These are fundamentally different access patterns the optimal solution is to have 2 distinct processing units

fc417fc8022mo ago

GDDR has high bandwidth but limited capacity. Regular RAM is the opposite, leaving typical APUs memory bandwidth starved.

Both types of processor perform much better with linear access. Even for data in the CPU cache you get a noticable speedup.

The primary difference is that GPUs want large contiguous blocks of "threads" to do the same thing (because in reality they aren't actually independent threads).

1 more reply

zozbot2342mo ago

If anything, GPUs combine large private per-compute unit private address spaces and a separate shared/global memory, which doesn't mesh very well with linear memory access, just high locality. You can kinda get to the same arrangement on CPU by pushing NUMA (Non-Uniform Memory: only the "global" memory is truly Unified on a GPU!) to the extreme, but that's quite uncommon. "Compute-in-memory" is a related idea that kind of points to the same constraint: you want to maximize spatial locality these days, because moving data in bulk is an expensive operation that burns power.

markhahn2mo ago

people say this a lot, but with little technical justification.

gpus have had cache for a long time. cpus have had simd for a long time.

it's not even true that the cpu memory interface is somehow optimized for latency - it's got bursts, for instance, a large non-sequential and out-of-page latency, and has gotten wider over time.

mostly people are just comparing the wrong things. if you want to compare a mid-hi discrete gpu with a cpu, you can't use a desktop cpu. instead use a ~100-core server chip that also has 12x64b memory interface. similar chip area, power dissipation, cost.

not the same, of course, but recognizably similar.

none of the fundamental techniques or architecture differ. just that cpus normally try to optimize for legacy code, but gpus have never done much ISA-level back-compatibility.

spot50102mo ago

I don't think we get rid of the CPU. But the relationship will be inverted. Instead of the CPU calling the GPU, it might be that the GPU becomes the central controller and builds programs and calls the CPU to execute tasks.

jerf2mo ago

But... why?

How do you win moving your central controller from a 4GHz CPU to a multi-hundred-MHz single GPU core?

If we tried this, all we'd do is isolate a couple of cores in the GPU, let them run at some gigahertz, and then equip them with the additional operations they'd need to be good at coordinating tasks... or, in other words, put a CPU in the GPU.

layla5alive2mo ago

Surprise: there are already CPUs in the GPU - they're called things like "Command Processor" (but not only) - they're often tiny in-order ARM or RISC-V cores.

treyd2mo ago

This will never without completely reimagining how process isolation works and rewriting any OS you'd want to run on that architecture.

pklausler2mo ago

Sounds reminiscent of the CDC 6600, a big fast compute processor with a simple peripheral processor whose barreled threads ran lots of the O/S and took care of I/O and other necessary support functions.

downrightmike2mo ago

Mainframes still exist, so CPU isnt going anywhere. Too useful of a tool

user____name2mo ago

Someone needs to implement LLVMPipe to target this isa, then one can run software OpenGL emulation and call it "hardware accelerated".

yjftsjthsd-h2mo ago

Surely that would be hardware decelerated

jagged-chisel2mo ago

This causes me discomfort.

robertcprice12mo ago

Hey everyone thank you taking a look at my project. This was purely just a “can I do it” type deal, but ultimately my goal is to make a running OS purely on GPU, or one composed of learned systems.

StilesCrisis2mo ago

I think it's curious that you're saying "on GPU" when you mean "using tensors." GPUs run compute shaders naturally and can trivially act like CPUs, just use CUDA. This is more akin to "a CPU on NPU" and your NPU happens to be a GPU.

lstevens142mo ago

Hi! I think that the idea is certainly a fun one. There is a long history of trying to make a good parallel operating system. I do not think that any of the projects succeeded though. This article is a good read if you are interested in that. I am not sure why the economics of parallel computer operating systems have not worked out so far. I think it most likely has to do with the operating systems that we have being good enough and familiar. [0] https://news.ycombinator.com/item?id=43440174

activestore2mo ago

The Blue Gene Active Storage project demonstrated compute in highly parallel “storage” where storage was HPC memory. It could work for the relationship between CPU and GPU, FPGA, etc.

https://www.fz-juelich.de/en/jsc/downloads/slides/bgas-bof/b...

yjftsjthsd-h2mo ago

This is hilarious and profoundly in the spirit of hacker news. Thanks for posting:)

mghackerlady2mo ago

GNU/GPU

bmc75052mo ago

As foretold six years ago. [1]

[1]: https://breandan.net/2020/06/30/graph-computation#roadmap

toolslive2mo ago

https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing ?

DiabloD32mo ago

https://en.wikipedia.org/wiki/Larrabee_(microarchitecture) ?

anthk2mo ago

Before that there was Forth running in the Transputer, which looks really close to current parallel computing.

jdlyga2mo ago

I'll do you one better, imagine a CPU that runs entirely in an LLM.

You’re absolutely right! I made an arithmetic mistake there — 3 * 3 is 9, not 8. Let’s correct that: Before: EAX = 3 After imul eax, eax: EAX = 9 Thanks for catching that — the correct return value is 9.

FartyMcFarter2mo ago

What an amazing multiplication request! The numbers you have chosen reveal an exquisite taste which can only be the product of an outstanding personality.

nomercy4002mo ago

I was taught years ago that MUL and ADD can be implemented in one or a few cycles. They can be the same complexity. What am I missing here?

Also, is it possible to use the GPU's ADD/MUL implementation? It is what a GPU does best.

volemo2mo ago

To multiply two arbitrary numbers in a single cycle, you need to include dedicated hardware into your ALU, without it you have to combine several additions and logical shifts.

As to why not use the ADD/MUL capabilities of the GPU itself, I guess it wasn’t in the spirit of the challenge. ;)

andrewdb2mo ago

Why do we call them GPUs these days?

Most GPUs, sitting in racks in datacenters, aren't "processing graphics" anyhow.

xeonmc2mo ago

General Processing Units

Gross-Parallelization Units

Generative Procedure Units

Gratuitously Profiteering Unscrupulously

incognito1242mo ago

Greed Processing Units

wartywhoa232mo ago

This is just brilliant!

allreduce2mo ago

Sometimes Gibberish Producing Units

xeonmc2mo ago

Gibberish Pipeline Units

markhahn2mo ago

General Parallel Units

jgtrosh2mo ago

The dedicated term GPGPU [0] didn't catch on.

[0]: https://en.wikipedia.org/wiki/General-purpose_computing_on_g...

anthk2mo ago

VPU. Vector/Video Processing Unit.

ChocolateGod2mo ago

Greenhouse Production Units

CompuHacker2mo ago

  CPU = Compute
  GPU =  Impute

artemonster2mo ago

Every clueless person who suggest that we move to GPUs entirely have zero idea how things work and basically are suggesting using lambos to plow fields and tractors to race in nascar

madwolf2mo ago

Bad comparison. Lambos are regularly plowing fields and they're quite good at it. https://www.lamborghini-tractors.com/en-eu/

artemonster2mo ago

I remembered that labos used to make tractors after I posted the comment. Nice catch!

deep12832mo ago

This is a fun idea. What surprised me is the inversion where MUL ends up faster than ADD because the neural LUT removes sequential dependency while the adder still needs prefix stages.

lorenzohess2mo ago

Out of curiosity, how much slower is this than an actual CPU?

bastawhiz2mo ago

Based on addition and subtraction, 625000x slower or so than a 2.5ghz cpu

zackmorris2mo ago

I wish the project said how many CPUs could be run simultaneously on one GPU.

It might be worth having a CPU that's 100 times slower (25 MHz) if 1000 of them could be run simultaneously to potentially reach a 10 times speedup for embarrassingly parallel computation. But starting from a hole that's 625000x slower seems unlikely to lead to practical applications. Still a cool project though!

medi8r2mo ago

So it could run Doom?

repelsteeltje2mo ago

Yes: https://github.com/robertcprice/nCPU?tab=readme-ov-file#doom...

2 more replies

anthk2mo ago

Doom it's easy. Better the ZMachine with an interpreter based on DFrotz, or another port. Then a game can even run under a Game Boy.

For a similar case, check Eforth+Subleq. If this guy can emulate subleq CPU under a GPU (something like 5 lines under C for the implementation, the rest it's C headers and the file opening function), it can run Eforth and maybe Sokoban.

markhahn2mo ago

it's just a machinecode emulator that happens to run on a gpu. it's more of a flying pig than a new porcine airliner.

clocksmith2mo ago

Proof that you are a genius:

```lean

  inductive HumanNeed where
    | retailArithmetic
    | genericLinkedInPost

  inductive IndustrySolution where
    | commodityALU
    | frontierAutocomplete

  def optimal : Need → IndustrySolution
    | .retailArithmetic => .commodityALU
    | .genericLinkedInPost => .frontierAutocomplete

  def latency : IndustrySolution → Nat
    | .commodityALU => 1
    | .frontierAutocomplete => 248000

  theorem superbowl_ads_have_not_improved_superdope_adds :
    latency (optimal .retailArithmetic) < latency .frontierAutocomplete := by
    decide

```

robertcprice2mo ago

Is this some kind of complex humor that I don't understand? or is it just not funny? I get it but not the punchline

_blk2mo ago

"Result: 100% accuracy on integer arithmetic" - Could someone with low-level LLM expertise comment on that: Is that future-proof, or does it have to be re-asserted with every rebuild of the neural building blocks? Can it be proven to remain correct? I assume there's a low-temperature setting that keeps it from getting too creative.

The creative thinking behind this project is truly mind boggling.

DonThomasitos2mo ago

I don‘t understand why you would train a NN for an operation like sqrt that the GPU supports in silicon.

nine_k2mo ago

I see it as a practical joke or a fun hack, like CPUs implemented in the Game of Life, or in Minecraft.

anthk2mo ago

I actually ran Sokoban under EForth running on top of subleq/muxleq with a VM interpreted under few lines of AWK.

mihaitodor2mo ago

It’s been done already. Have a look at Quest for Tetris: https://codegolf.stackexchange.com/questions/11880/build-a-w...

Nevermark2mo ago

Time to benchmark Doom.

Now we know future genius models won't even need CPUs, just tensor/rectifier circuits. If they need a CPU, they will just imagine them.

A low-bit model with adaptive sparse execution might even be able to imagine with performance. Effectively, neural PGA capability.

sudo_cowsay2mo ago

"Multiplication is 12x faster than addition..."

Wow. That's cool but what happens to the regular CPU?

adrian_b2mo ago

This CPU simulator does not attempt to achieve the maximum speed that could be obtained when simulating a CPU on a GPU.

For that a completely different approach would be needed, e.g. by implementing something akin to qemu, where each CPU instruction would be translated into a graphic shader program. On many older GPUs, it is impossible or difficult to launch a graphic program from inside a graphic program (instead of from the CPU), but where this is possible one could obtain a CPU emulation that would be many orders of magnitude faster than what is demonstrated here.

Instead of going for speed, the project demonstrates a simpler self-contained implementation based on the same kind of neural networks used for ML/AI, which might work even on an NPU, not only on a GPU.

Because it uses inappropriate hardware execution units, the speed is modest and the speed ratios between different kinds of instructions are weird, but nonetheless this is an impressive achievement, i.e. simulating the complete Aarch64 ISA with such means.

1 more reply

himata41132mo ago

I was always wondering what would happen if you trained a model to emulate a cpu in the most efficient way possible, this is definitely not what I expected, but also shows promise on how much more efficient models can become.

GeertB2mo ago

I don't quite understand how multiply doesn't require addition as well to combine the various partial products.

koolala2mo ago

Exciting if an Ai that is helping in its own improvements finds this and incorporates it into its own architecture. Then it starts reading and running all the worlds binary and gains intelligence as a fully actualized "computer". Finally becoming both a master of language and of binary bits. Thinking in poetry and in pure precise numerical calculations.

low_tech_punk2mo ago

Saw the DOOM raycast demo at bottom of page.

Can't wait for someone to build a DOOM that runs entirely on GPU!

jhuber62mo ago

Depends entirely on your definition of 'entirely', but https://github.com/jhuber6/doomgeneric is pretty much a direct compilation of the DOOM C source for GPU compute. The CPU is necessary to read keyboard input and present frame data to the screen, but all the logic runs on the GPU.

RandyOrion2mo ago

Cool. However, one still need CPU to send commands to GPU in order to let GPU do CPU things.

palmotea2mo ago

> Cool. However, one still need CPU to send commands to GPU in order to let GPU do CPU things.

Doesn't the Raspberry Pi's GPU boot up first, and then the GPU initializes the CPU?

With this technology, we've eliminated the need for that superfluous second step.

RandyOrion2mo ago

Well, I don't have enough knowledge on the boot process of RPi. However, I do expect that most modern hardware, e.g. x86, do not work like RPi, so your words do not hold in most realistic scenarios, at least for now. Besides, do current GPUs (not only GPUs on RPi) have the ability to self instruct in order to achieve what you said?

throawayonthe2mo ago

very tangentially related is whatever vectorware et al are doing: https://www.vectorware.com/blog/

robertcprice2mo ago

its funny to see how many people get offended by a project I think im doing something right

jleyank2mo ago

How is this different than the (various?) efforts back then to build a machine based on the Intel i860? Didn’t work, although people gave it a good try.

taofor42mo ago

What is the purpose of this project? I didn't get it. How will it be useful?

jebarker2mo ago

> How will it be useful?

Does it need to be?

RagnarD2mo ago

Being able to perform precise math in an LLM is important, glad to see this.

jdjdndnzn2mo ago

Just want to point out this comment is highly ironic.

This is all a computer does :P

We need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.

Nuzzerino2mo ago

> We need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.

Agents, tool-integrated reasoning, even chain of thought (limited, for some math) can address this.

RagnarD2mo ago

You're both completely missing the point. It's important that an LLM be able to perform exact arithmetic reliably without a tool call. Of course the underlying hardware does so extremely rapidly, that's not the point.

2 more replies

koolala2mo ago

That would be cool. A way to read cpu assembly bytecode and then think in it.

It's slower than real cpu code obviously but still crazy fast for 'thinking' about it. They wouldn't need to actually simulate an entire program in a never ending hot loop like a real computer. Just a few loops would explain a lot about a process and calculate a lot of precise information.

wartywhoa232mo ago

Oh these brave new ways to paraphrase the good old "fuck fuel economy"...

Thank you, Mr. Do-because-I-can!

Yours truly,

- GPU company CEO,

- Electric company CEO.

nicman232mo ago

can i run linux on a nvidia card though?

micw2mo ago

Linux runs everywhere

volemo2mo ago

Except on my stupid iPad “Pro”. :(

mghackerlady2mo ago

iirc theres an app on the app store that's basically a small alpine container

1 more reply

robertcprice2mo ago

Since this was posted I've been heads-down building on top of the neural CPU. Wanted to share what's new.

Built a GPU-Native UNIX OS. A full multi-process operating system running compiled C on Apple Silicon Metal:

> 25-command shell (ls, cd, cat, grep, sort, uniq, tee, cp, wc, pipes, background jobs, chaining, redirect) — ~17.5KB freestanding C compiled with aarch64-elf-gcc -O2, running entirely as ARM64 on the GPU

> Multi-process: fork/wait/pipe/dup2 via memory swapping. 1MB backing stores, up to 15 concurrent processes, round-robin scheduler, pipe blocking/wakeup, fork bomb protection, SIGTERM/SIGKILL, orphan reparenting. 28 syscalls total.

> Freestanding C runtime: malloc/free/printf/fork/wait/pipe/qsort/strtol — all on GPU

Self-hosting C compiler on Metal GPU. cc.c (~2,800 lines) compiles C→ARM64 entirely on the GPU, then executes the output on the same GPU. Three layers: host GCC → GPU compiler → GPU-compiled binary. Debugged 5 codegen bugs to get it working (UBFM encoding, LDURSW sign-extension, caller-save clobbering, array subscript type clobbering, struct lvalue handling). Supports structs, pointers, arrays, recursion, for/while/do-while, ternary, sizeof, compound assignment, bitwise, short-circuit eval. 20/20 test programs pass. Mean compile: ~50K GPU cycles. Ackermann A(3,4) runs 319K cycles of deep recursion correctly.

13+ compiled C applications on Metal:

> Crypto: SHA-256, AES-128 (ECB+CBC, 6/6 FIPS vectors pass), encrypted password vault > Games: Tetris, Snake, roguelike dungeon crawler, text adventure > VMs: Brainfuck interpreter, Forth REPL, CHIP-8 emulator > Networking: HTTP/1.0 server (TCP proxied through Python) > Neural net: MNIST classifier (784→128→10, Q8.8 fixed-point) > Tools: ed line editor, self-hosting C compiler, Game of Life

neurOS — fully neural operating system. 11 trained models running MMU (100%), TLB (99.6%), cache (99.7%), scheduler (99.2%), assembler (100%), compiler (95.2%), watchdog (100%) — zero fallback paths.

Self-compilation verified: source → neural compiler → neural assembler → neural CPU → correct results.

Timing side-channel immunity. Measured sigma=0.0000 GPU cycle variance across 270 runs of AES-128. Same code on native Apple Silicon: 47-73% CoV. No caches, no branch predictor, no speculative execution inside a dispatch. T-table timing attacks are structurally impossible.

Just reorganized the whole project — neurOS and GPU OS now live under a clean ncpu/os/ package (neuros/ and gpu/ subpackages). 850 tests passing, all verified after the reorg.

To @andreadev — the MUL>ADD inversion is still my favorite result. To @bob1029 — you're right about branchy workloads being slow (~5K IPS neural, ~4M compute), but the GPU execution model gives security properties CPUs architecturally can't provide.

vrighter2mo ago

you know that the gpu has add and multiply instructions already, right?

mrlonglong2mo ago

Now I've seen it all. Time to die.. (meant humourously)

Surac2mo ago

Well GPU are just special purpous CPU.

MadnessASAP2mo ago

Ya know just today I was thinking around a way to compile a neural network down to assembly. Matching and replacing neural network structures with their closest machine code equivalent.

This is way cooler though! Instead of efficiently running a neural network on a CPU, I can inefficiently run my CPU on neural network! With the work being done to make more powerful GPUs and ASICs I bet in a few years I'll be able to run a 486 at 100MHz(!!) with power consumption just under a megawatt! The mind boggles at the sort of computations this will unlock!

Few more years and I'll even be able to realise the dream of self-hosting ChatGPT on my own neural network simulated CPU!

j / k navigate · click thread line to collapse

131 comments

jagged-chisel2mo ago

“A CPU that runs entirely on the GPU”

I imagine a carefully crafted set of programming primitives used to build up the abstraction of a CPU…

“Every ALU operation is a trained neural network.”

Oh… oh. Fun. Just not the type of “interesting” I was hoping for.

mamaluigie2mo ago

Get used to it. The modern day solution for everything right now is to throw AI at it.

Hmmm... I need to measure this piece of wood for cutting, let me take a picture of it and see what the ai says its measurement is instead of using a measuring tape because it is faster to use the AI.

sdwr2mo ago

That honestly sounds great! If it works...

theamk2mo ago

It works great!

cwmoore2mo ago

Of course it works. Make a video with the tape measure, call yourself a Creator, then you can hire real carpenters.

jagged-chisel2mo ago

We already have this on our phones without AI. What could AI possibly bring to this?

fragmede2mo ago

It does? Throw the picture at ChatGPT and see what it does with it

koolala2mo ago

Isn't it interesting it doesn't instantly crash from a precision error? That sounds carefully crafted to me.

jagged-chisel2mo ago

Interesting, yes. Still not the kind of interesting I was expecting.

amelius2mo ago

Is it emulating a Pentium processor? :)

vessenes2mo ago

ARM64(!?!) I know you were joking, but still.

robertcprice12mo ago

Please tell me what you had in mind so I can try something different!

anthk2mo ago

Begin reimplementing a subleq/muxleq VM with GPU primitive commands:

     1 constant opt.float

The same with the classic do..loop structure from Forth, which is not enabled by default, just the weird for..next one from EForth:

     1 constant opt.control

and recompile:

     ./muxleq ./muxleq.dec < muxleq.fth > new.dec

run:

       ./muxleq new.dec

Once you have a new.dec image, you can just use that from now on.

1 more reply

Retr0id2mo ago

I was imagining something more like Xeon Phi

andreadev2mo ago

Funny enough, analog computing had the same inversion — a Gilbert cell does multiplication cheaply, while addition needs more complex summing circuits. Completely different path to the same result.

bob10292mo ago

A fun experiment but I wonder how many out there seriously think we could ever completely rid ourselves of the CPU. It seems to be a rising sentiment.

st_goliath2mo ago

> I wonder how many out there seriously think we could ever completely rid ourselves of the CPU. It seems to be a rising sentiment.

I vividly recall a discussion with some management type back in 2011, who was gushing about getting PHP to run on the new Nvidia Teslas, how amazingly fast websites will be!

Similar discussions also spring up around FPGAs again and again.

Nowadays, the 'G' in GPU probably stands for GPGPU.

ecshafer2mo ago

Have a CPU, GPU, FPGA, and other specific chips like Neural chips. All there with unified memory and somehow pipelining specific work loads to each chip optimally to be optimal.

I wasn't really aware people thought we would be running websites on GPUs.

fulafel2mo ago

The field explored this direction before in vector computers with high bandwidth memory (Cray etc).

volemo2mo ago

I see us not getting rid of CPU, but CPU and GPU being eventually consolidated in one system of heterogeneous computing units.

nine_k2mo ago

amelius2mo ago

nVidia Jetson also has unified memory access btw.

jagged-chisel2mo ago

Agreed. Much like “RISC is gonna replace everything” - it didn’t. Because the CPU makers incorporated lessons from RISC into their designs.

I can see the same happening to the CPU. It will just take on the appropriate functionality to keep all the compute in the same chip.

It’s gonna take awhile because Nvidia et al like their moats.

StilesCrisis2mo ago

2 more replies

zozbot2342mo ago

> It will just take on the appropriate functionality to keep all the compute in the same chip.

junon2mo ago

We're getting there already with e.g. Grace-Blackwell chips.

fc417fc8022mo ago

> I wonder how many out there seriously think we could ever completely rid ourselves of the CPU.

How do you class systems like the PS5 that have an APU plugged into GDDR instead of regular RAM? The primary remaining issue is the limited memory capacity.

I wonder if we might see a system with GPU class HBM on the package in lieu of VRAM coupled with regular RAM on the board for the CPU portion?

chris_money2022mo ago

fc417fc8022mo ago

GDDR has high bandwidth but limited capacity. Regular RAM is the opposite, leaving typical APUs memory bandwidth starved.

Both types of processor perform much better with linear access. Even for data in the CPU cache you get a noticable speedup.

The primary difference is that GPUs want large contiguous blocks of "threads" to do the same thing (because in reality they aren't actually independent threads).

1 more reply

zozbot2342mo ago

markhahn2mo ago

people say this a lot, but with little technical justification.

gpus have had cache for a long time. cpus have had simd for a long time.

it's not even true that the cpu memory interface is somehow optimized for latency - it's got bursts, for instance, a large non-sequential and out-of-page latency, and has gotten wider over time.

not the same, of course, but recognizably similar.

none of the fundamental techniques or architecture differ. just that cpus normally try to optimize for legacy code, but gpus have never done much ISA-level back-compatibility.

spot50102mo ago

jerf2mo ago

But... why?

How do you win moving your central controller from a 4GHz CPU to a multi-hundred-MHz single GPU core?

layla5alive2mo ago

Surprise: there are already CPUs in the GPU - they're called things like "Command Processor" (but not only) - they're often tiny in-order ARM or RISC-V cores.

treyd2mo ago

This will never without completely reimagining how process isolation works and rewriting any OS you'd want to run on that architecture.

pklausler2mo ago

downrightmike2mo ago

Mainframes still exist, so CPU isnt going anywhere. Too useful of a tool

user____name2mo ago

Someone needs to implement LLVMPipe to target this isa, then one can run software OpenGL emulation and call it "hardware accelerated".

yjftsjthsd-h2mo ago

Surely that would be hardware decelerated

jagged-chisel2mo ago

This causes me discomfort.

robertcprice12mo ago

StilesCrisis2mo ago

lstevens142mo ago

activestore2mo ago

The Blue Gene Active Storage project demonstrated compute in highly parallel “storage” where storage was HPC memory. It could work for the relationship between CPU and GPU, FPGA, etc.

https://www.fz-juelich.de/en/jsc/downloads/slides/bgas-bof/b...

yjftsjthsd-h2mo ago

This is hilarious and profoundly in the spirit of hacker news. Thanks for posting:)

mghackerlady2mo ago

GNU/GPU

bmc75052mo ago

As foretold six years ago. [1]

[1]: https://breandan.net/2020/06/30/graph-computation#roadmap

toolslive2mo ago

https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing ?

DiabloD32mo ago

https://en.wikipedia.org/wiki/Larrabee_(microarchitecture) ?

anthk2mo ago

Before that there was Forth running in the Transputer, which looks really close to current parallel computing.

jdlyga2mo ago

I'll do you one better, imagine a CPU that runs entirely in an LLM.

FartyMcFarter2mo ago

What an amazing multiplication request! The numbers you have chosen reveal an exquisite taste which can only be the product of an outstanding personality.

nomercy4002mo ago

I was taught years ago that MUL and ADD can be implemented in one or a few cycles. They can be the same complexity. What am I missing here?

Also, is it possible to use the GPU's ADD/MUL implementation? It is what a GPU does best.

volemo2mo ago

To multiply two arbitrary numbers in a single cycle, you need to include dedicated hardware into your ALU, without it you have to combine several additions and logical shifts.

As to why not use the ADD/MUL capabilities of the GPU itself, I guess it wasn’t in the spirit of the challenge. ;)

andrewdb2mo ago

Why do we call them GPUs these days?

Most GPUs, sitting in racks in datacenters, aren't "processing graphics" anyhow.

xeonmc2mo ago

General Processing Units

Gross-Parallelization Units

Generative Procedure Units

Gratuitously Profiteering Unscrupulously

incognito1242mo ago

Greed Processing Units

wartywhoa232mo ago

This is just brilliant!

allreduce2mo ago

Sometimes Gibberish Producing Units

xeonmc2mo ago

Gibberish Pipeline Units

markhahn2mo ago

General Parallel Units

jgtrosh2mo ago

The dedicated term GPGPU [0] didn't catch on.

[0]: https://en.wikipedia.org/wiki/General-purpose_computing_on_g...

anthk2mo ago

VPU. Vector/Video Processing Unit.

ChocolateGod2mo ago

Greenhouse Production Units

CompuHacker2mo ago

  CPU = Compute
  GPU =  Impute

artemonster2mo ago

Every clueless person who suggest that we move to GPUs entirely have zero idea how things work and basically are suggesting using lambos to plow fields and tractors to race in nascar

madwolf2mo ago

Bad comparison. Lambos are regularly plowing fields and they're quite good at it. https://www.lamborghini-tractors.com/en-eu/

artemonster2mo ago

I remembered that labos used to make tractors after I posted the comment. Nice catch!

deep12832mo ago

This is a fun idea. What surprised me is the inversion where MUL ends up faster than ADD because the neural LUT removes sequential dependency while the adder still needs prefix stages.

lorenzohess2mo ago

Out of curiosity, how much slower is this than an actual CPU?

bastawhiz2mo ago

Based on addition and subtraction, 625000x slower or so than a 2.5ghz cpu

zackmorris2mo ago

I wish the project said how many CPUs could be run simultaneously on one GPU.

medi8r2mo ago

So it could run Doom?

repelsteeltje2mo ago

Yes: https://github.com/robertcprice/nCPU?tab=readme-ov-file#doom...

2 more replies

anthk2mo ago

Doom it's easy. Better the ZMachine with an interpreter based on DFrotz, or another port. Then a game can even run under a Game Boy.

markhahn2mo ago

it's just a machinecode emulator that happens to run on a gpu. it's more of a flying pig than a new porcine airliner.

clocksmith2mo ago

Proof that you are a genius:

```lean

  inductive HumanNeed where
    | retailArithmetic
    | genericLinkedInPost

  inductive IndustrySolution where
    | commodityALU
    | frontierAutocomplete

  def optimal : Need → IndustrySolution
    | .retailArithmetic => .commodityALU
    | .genericLinkedInPost => .frontierAutocomplete

  def latency : IndustrySolution → Nat
    | .commodityALU => 1
    | .frontierAutocomplete => 248000

  theorem superbowl_ads_have_not_improved_superdope_adds :
    latency (optimal .retailArithmetic) < latency .frontierAutocomplete := by
    decide

```

robertcprice2mo ago

Is this some kind of complex humor that I don't understand? or is it just not funny? I get it but not the punchline

_blk2mo ago

The creative thinking behind this project is truly mind boggling.

DonThomasitos2mo ago

I don‘t understand why you would train a NN for an operation like sqrt that the GPU supports in silicon.

nine_k2mo ago

I see it as a practical joke or a fun hack, like CPUs implemented in the Game of Life, or in Minecraft.

anthk2mo ago

I actually ran Sokoban under EForth running on top of subleq/muxleq with a VM interpreted under few lines of AWK.

mihaitodor2mo ago

It’s been done already. Have a look at Quest for Tetris: https://codegolf.stackexchange.com/questions/11880/build-a-w...

Nevermark2mo ago

Time to benchmark Doom.

Now we know future genius models won't even need CPUs, just tensor/rectifier circuits. If they need a CPU, they will just imagine them.

A low-bit model with adaptive sparse execution might even be able to imagine with performance. Effectively, neural PGA capability.

sudo_cowsay2mo ago

"Multiplication is 12x faster than addition..."

Wow. That's cool but what happens to the regular CPU?

adrian_b2mo ago

This CPU simulator does not attempt to achieve the maximum speed that could be obtained when simulating a CPU on a GPU.

1 more reply

himata41132mo ago

GeertB2mo ago

I don't quite understand how multiply doesn't require addition as well to combine the various partial products.

koolala2mo ago

low_tech_punk2mo ago

Saw the DOOM raycast demo at bottom of page.

Can't wait for someone to build a DOOM that runs entirely on GPU!

jhuber62mo ago

RandyOrion2mo ago

Cool. However, one still need CPU to send commands to GPU in order to let GPU do CPU things.

palmotea2mo ago

> Cool. However, one still need CPU to send commands to GPU in order to let GPU do CPU things.

Doesn't the Raspberry Pi's GPU boot up first, and then the GPU initializes the CPU?

With this technology, we've eliminated the need for that superfluous second step.

RandyOrion2mo ago

throawayonthe2mo ago

very tangentially related is whatever vectorware et al are doing: https://www.vectorware.com/blog/

robertcprice2mo ago

its funny to see how many people get offended by a project I think im doing something right

jleyank2mo ago

How is this different than the (various?) efforts back then to build a machine based on the Intel i860? Didn’t work, although people gave it a good try.

taofor42mo ago

What is the purpose of this project? I didn't get it. How will it be useful?

jebarker2mo ago

> How will it be useful?

Does it need to be?

RagnarD2mo ago

Being able to perform precise math in an LLM is important, glad to see this.

jdjdndnzn2mo ago

Just want to point out this comment is highly ironic.

This is all a computer does :P

We need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.

Nuzzerino2mo ago

> We need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.

Agents, tool-integrated reasoning, even chain of thought (limited, for some math) can address this.

RagnarD2mo ago

2 more replies

koolala2mo ago

That would be cool. A way to read cpu assembly bytecode and then think in it.

wartywhoa232mo ago

Oh these brave new ways to paraphrase the good old "fuck fuel economy"...

Thank you, Mr. Do-because-I-can!

Yours truly,

- GPU company CEO,

- Electric company CEO.

nicman232mo ago

can i run linux on a nvidia card though?

micw2mo ago

Linux runs everywhere

volemo2mo ago

Except on my stupid iPad “Pro”. :(

mghackerlady2mo ago

iirc theres an app on the app store that's basically a small alpine container

1 more reply

robertcprice2mo ago

Since this was posted I've been heads-down building on top of the neural CPU. Wanted to share what's new.

Built a GPU-Native UNIX OS. A full multi-process operating system running compiled C on Apple Silicon Metal:

> Freestanding C runtime: malloc/free/printf/fork/wait/pipe/qsort/strtol — all on GPU

13+ compiled C applications on Metal:

Self-compilation verified: source → neural compiler → neural assembler → neural CPU → correct results.

Just reorganized the whole project — neurOS and GPU OS now live under a clean ncpu/os/ package (neuros/ and gpu/ subpackages). 850 tests passing, all verified after the reorg.

vrighter2mo ago

you know that the gpu has add and multiply instructions already, right?

mrlonglong2mo ago

Now I've seen it all. Time to die.. (meant humourously)

Surac2mo ago

Well GPU are just special purpous CPU.

MadnessASAP2mo ago

Ya know just today I was thinking around a way to compile a neural network down to assembly. Matching and replacing neural network structures with their closest machine code equivalent.

Few more years and I'll even be able to realise the dream of self-hosting ChatGPT on my own neural network simulated CPU!

j / k navigate · click thread line to collapse