Real-time LLM Inference on Standard GPUs: 3k tokens/s per request (opens in new tab)

(blog.kog.ai)

186 pointsNicoConstant10h ago84 comments

84 comments

This looks very interesting. Possible to get those rates without exotic hardware.

But I have to say that the comparison is not really fair. Comparison is done with a 2 B model vs frontier models that are likely 100s of times larger. Also taalas with their 15000 tok/s inference are suspiciously missing from the comparison.

We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.

4 more replies

gaeld9h ago

Follow-up reading the most technical and research people here:

Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...

Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...

To try the speed on the playground: http://playground.kog.ai

1 more reply

stymaar5h ago

This is very cool.

I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware.

It's great to see that with proper care on the inference engine implementation the relationship can be restored.

0-bad-sectors9h ago

When I read "Standard GPUs" in the title I got excited for a second then I read the article itself..

2 more replies

blindr2h ago

For me it's 3.4k tok/s of pure nonsense, the model is bad, you tell it it's wrong, it acknowledge it's wrong and repeats the same nonsense. It reminds me my nephew though. Ask it something like: "I want to play the guitar on the surface of the Moon. What speakers do you suggest." and then "But Moon has no atmosphere, how the sound will travel?".

1 more reply

867-53099h ago

> Standard GPUs

> 8× NVIDIA H200

3 more replies

rashkov6h ago

Don't miss trying their demo: https://playground.kog.ai/

Feels like a preview of the future

3 more replies

ilaksh9h ago

Could be amazing, but it's hard to judge if it will really work with say a 27 B model or larger. We can already get pretty good speed with a 2B model.

1 more reply

bcjdjsndon6h ago

H200 isn't a standard GPU at all

1 more reply

joefourier3h ago

Do you think the work will still apply to speculative/alternative decoding methods like MTP and block diffusion, which are making batch=1 decoding less memory bound? Kernel launch overhead and memory transfer become less and less significant as a % of time when computing multiple tokens at once.

1 more reply

CastFX8h ago

Looks super promising! A couple of questions:

For new open weights models, will you need to adapt model code and optimization for your inference engine by hand?

It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ?

Any plans to release it open source?

Congratz again for the release

1 more reply

cataflam6h ago

Congrats gaeld and team

The demo is very impressive!

disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware

arjie4h ago

Huh, interesting. Some parts of this do generalize even to an RTX 6000 Pro Blackwell, I imagine, though we're going to be solidly bottlenecked then on inter-card throughput through the PCIe interface.

robmccoll8h ago

Making these claims on a 2B parameter model seems a bit like seeing linear scalability from 1 to 4 cores and then assuming 256 cores will give you a 256x speedup. Or demonstrating massive improvement on datasets that fit in cache and then assuming the same improvements will be present on problem sizes that span the memory of multiple machines. Something tells me that scaling to larger models will be more difficult than assumed.

1 more reply

dchftcs4h ago

An article with a title saying tokens per second throughput without any qualifier e.g. what size the model is should immediately be classified as spam.

freediddy1h ago

Who cares about token speed? What is the quality of the results like? I don't know why people are so fixated on token speed, since no one cares how quickly it can spew garbage. Most reasonable people are happier waiting a bit more for accurate results.

1 more reply

kirtivr9h ago

I can think of real time video, shader generation, real time worldbuilding type problems could require such a high token throughput.

For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.

Gomotono9h ago

That sounds a little bit like the 64kb memory is enough, then someone invented electron ;P

But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.

You could literaly implement the same solution 100x and benchmark all of them and get only the best result.

You could build and architecture a whole stack in parallel.

You could do massive thinking token / chain of thought.

You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.

We could start doing some type of monte-carlo search with this.

frankensteins6h ago

I have a naive question here - first, the token speed is very impressive. but why this is the highlight? I would prefer the actual performance.

1 more reply

DeathArrow3h ago

>This preview runs a 2B model

I guess with 1B or 500M model inference would be even faster?

1 more reply

paul-rohan6h ago

I had to test it myself to believe this unreal inference speed.

each time getting 3300+ tps.

bartkappenburg8h ago

Is this the new gateway to a "Model On a Chip"? Is it possible to etch the weights on silicon and get a very efficient way to use a LLM?

ekianjo7h ago

Title is pure bait. Where is Datacenter GPU gone?

irishcoffee9h ago

NVIDIA H200 Is not a standard GPU. 8 of them in a box with a cpu and ram costs close to the same as a house.

I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.

I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.

If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.

1 more reply

LoganDark9h ago

I feel the comparison to Groq is unfair. They're running much larger models (orders of magnitude) and still reaching competitive speeds.

1 more reply

nickphx59m ago

cool.. so now approximations of copies of copies can be approximated and copied faster?

Hfuffzehn7h ago

That's really nice of them.

That means Jensen can add another 30 times faster when comparing Rubin to Blackwell without having to actually do anything.

Hopefully that means he won't have any problem to make another 150 billion in profit in the next year.

Sorry for the sarcasm. Looks like interesting work.

j / k navigate · click thread line to collapse

84 comments

mungoman29h ago

This looks very interesting. Possible to get those rates without exotic hardware.

We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.

4 more replies

gaeld9h ago

Follow-up reading the most technical and research people here:

Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...

Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...

To try the speed on the playground: http://playground.kog.ai

1 more reply

stymaar5h ago

This is very cool.

I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware.

It's great to see that with proper care on the inference engine implementation the relationship can be restored.

0-bad-sectors9h ago

When I read "Standard GPUs" in the title I got excited for a second then I read the article itself..

2 more replies

blindr2h ago

1 more reply

867-53099h ago

> Standard GPUs

> 8× NVIDIA H200

3 more replies

rashkov6h ago

Don't miss trying their demo: https://playground.kog.ai/

Feels like a preview of the future

3 more replies

ilaksh9h ago

Could be amazing, but it's hard to judge if it will really work with say a 27 B model or larger. We can already get pretty good speed with a 2B model.

1 more reply

bcjdjsndon6h ago

H200 isn't a standard GPU at all

1 more reply

joefourier3h ago

1 more reply

CastFX8h ago

Looks super promising! A couple of questions:

For new open weights models, will you need to adapt model code and optimization for your inference engine by hand?

It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ?

Any plans to release it open source?

Congratz again for the release

1 more reply

cataflam6h ago

Congrats gaeld and team

The demo is very impressive!

disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware

arjie4h ago

robmccoll8h ago

1 more reply

dchftcs4h ago

An article with a title saying tokens per second throughput without any qualifier e.g. what size the model is should immediately be classified as spam.

freediddy1h ago

1 more reply

kirtivr9h ago

I can think of real time video, shader generation, real time worldbuilding type problems could require such a high token throughput.

For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.

Gomotono9h ago

That sounds a little bit like the 64kb memory is enough, then someone invented electron ;P

But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.

You could literaly implement the same solution 100x and benchmark all of them and get only the best result.

You could build and architecture a whole stack in parallel.

You could do massive thinking token / chain of thought.

You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.

We could start doing some type of monte-carlo search with this.

frankensteins6h ago

I have a naive question here - first, the token speed is very impressive. but why this is the highlight? I would prefer the actual performance.

1 more reply

DeathArrow3h ago

>This preview runs a 2B model

I guess with 1B or 500M model inference would be even faster?

1 more reply

paul-rohan6h ago

I had to test it myself to believe this unreal inference speed.

each time getting 3300+ tps.

bartkappenburg8h ago

Is this the new gateway to a "Model On a Chip"? Is it possible to etch the weights on silicon and get a very efficient way to use a LLM?

ekianjo7h ago

Title is pure bait. Where is Datacenter GPU gone?

irishcoffee9h ago

NVIDIA H200 Is not a standard GPU. 8 of them in a box with a cpu and ram costs close to the same as a house.

I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.

I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.

If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.

1 more reply

LoganDark9h ago

I feel the comparison to Groq is unfair. They're running much larger models (orders of magnitude) and still reaching competitive speeds.

1 more reply

nickphx59m ago

cool.. so now approximations of copies of copies can be approximated and copied faster?

Hfuffzehn7h ago

That's really nice of them.

That means Jensen can add another 30 times faster when comparing Rubin to Blackwell without having to actually do anything.

Hopefully that means he won't have any problem to make another 150 billion in profit in the next year.

Sorry for the sarcasm. Looks like interesting work.

j / k navigate · click thread line to collapse