But I have to say that the comparison is not really fair. Comparison is done with a 2 B model vs frontier models that are likely 100s of times larger. Also taalas with their 15000 tok/s inference are suspiciously missing from the comparison.
We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.
Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...
Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...
To try the speed on the playground: http://playground.kog.ai
I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware.
It's great to see that with proper care on the inference engine implementation the relationship can be restored.
Feels like a preview of the future
For new open weights models, will you need to adapt model code and optimization for your inference engine by hand?
It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ?
Any plans to release it open source?
Congratz again for the release
The demo is very impressive!
disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware
For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.
But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.
You could literaly implement the same solution 100x and benchmark all of them and get only the best result.
You could build and architecture a whole stack in parallel.
You could do massive thinking token / chain of thought.
You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.
We could start doing some type of monte-carlo search with this.
I guess with 1B or 500M model inference would be even faster?
each time getting 3300+ tps.
I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.
I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.
If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.
That means Jensen can add another 30 times faster when comparing Rubin to Blackwell without having to actually do anything.
Hopefully that means he won't have any problem to make another 150 billion in profit in the next year.
Sorry for the sarcasm. Looks like interesting work.