Am I missing something that makes the video novel?
I figured it wasn't given that you were showcasing a GL project. But nonethees disappointing as someone curious as to whether or not the language helped in indirect ways with how you structured your project and if you feel you could scale it up to something closer to production ready. That did seem to be the goal of Jai when I last looked into its development some 4 years ago.
https://old.reddit.com/r/Kha/comments/8hjupc/how_the_heck_is...
Doesn’t even have to do that, this is child’s play for a compute shader. The CPU can go take a 16 millisecond nap and let the GPU do all the work.
You really can achieve amazing stuff with just plain e.g. OpenGL optimized for your rendering needs. With todays GPU acceleration capabilities we could have town-building games with huge map resolutions and millions of entities. Instead its mostly only used to make fancy graphics.
Actually I am currently trying to build something like that [1]. A big big world with hundreds of millions of sprites is achievable and runs smoothly, video RAM is the limit. Admittedly it is not optimized to display those hundreds of millions of sprites all at once, maybe just a few millions. Would be a bit too chaotic for a game anyway I guess.
1000% agree.
I recently took it upon myself to see just how far I can push modern hardware with some very tight constraints. I've been playing around with a 100% custom 3D rasterizer which purely operates on the CPU. For reasonable scenes (<10k triangles) and resolutions (720~1080p), I have been able to push over 30fps with a single thread. On a 5950x, I was able to support over 10 clients simultaneously without any issues. The GPU in my workstation is just moving the final content to the display device via whatever means necessary. The machine generating the frames doesnt even need a graphics device installed at all...
To be clear, this is exceptionally primitive graphics capability, but there are many styles of interactive experience that do not demand 4k textures, global illumination, etc. I am also not fully extracting the capabilities of my CPU. There are many optimizations (e.g. SIMD) that could be applied to get even more uplift.
One fun thing I discovered is just how low latency a pure CPU rasterizer can be compared to a full CPU-GPU pipeline. I have CPU-only user-interactive experiences that can go from input event to final output frame in under 2 milliseconds. I don't think even games like Overwatch can react to user input that quickly.
What features does your renderer support in terms of shading and texturing? Are you writing this all in a high-level language, e.g. C, or assembler? If assembler, what CPUs and features are you targeting?
And of course, why?
i'm definitely going to have to test that! always trying to minimize input delay
Thus games that ship to a schedule are hugely incentivized to favor making smaller play spaces with more authored detail, since that controls all the outcomes and reduces the technical dependencies of how scenes are authored.
There is a more philosophical reason to go in that direction too: Simulation building is essentially the art of building Plato's cave, and spending all your time on making the cave very large and the puppets extremely elaborate is a rather dubious idea.
Although, there's a few space 4x games that try this "everything is simulated" kind of approach and succeed. Allowing AI control of everything the player doesn't want to manage themselves is one nice way of dealing with it. See: https://store.steampowered.com/app/261470/Distant_Worlds_Uni...
What made it of course was the art. An army of digital illustrators working by hand to create bitmaps that pop.
One pseudo 2.5d game I'm playing now is Iridion 2 GBA (2003). You can see the care taken with the art design team, pure lovers of the genre ;)
200000 * 200 * 2 = 80M tris/sec
200000 * 200 * 32x32px = 40 gpix/sec (if no occlusion culling)
Neither of those numbers are particularly huge for modern GPUs.
I'd wager that a compute shader + mesh shader based version of this could hit 2M sprites at 200 fps, though at some point we'd have to argue about what counts as "cheating" - if I do a clustered occlusion query that results in my pipeline discarding an invisible batch of 128 sprites, does that still count as "rendering" them?
but yes, that's cheating, since it's impractical to work with
edit: oh they do rabbits in the video as well what a bunny coincidence
edit2: the goroutines werent drawcalling btw, they were just moving the rabbits. The drawcalls were still made using a regular for loop, in case you wonder.
How are you finding working with it? Have you done a similar thing in C++ to compare the results and the process of writing it?
200k at 200fps on an 8700k with a 1070 seems like a lot of rabbits. Are there similar benchmarks to compare against in other languages?
this is just a test of opengl, C++ should be the same exact performance considering my cpu usage is only 7% while gpu usage is 80%. but the process of writing it is infinitely better than C++, since i never got C++ to compile a hardware accelerated bunnymark.
the only bunnymarks i'm aware of are slow https://www.reddit.com/r/Kha/comments/8hjupc/how_the_heck_is...
which is why i wrote this, to see how fast it could go.
My guess is that the rendering is not the hardest part, although it's kinda cool.
Is it faster to render two triangles with slightly less area, or one triangle with slightly more area, to draw the same sprite?
Second, modern GPUs render pixels in groups of 2x2 up to 8x8 "tiles". If only one pixel from this group is part of a triangle, the entire group will be rendered. When two triangles form a quad, the entire area along the diagonal "seam" will be rendered twice. The smaller quads you have, the more overhead.
Also see https://www.saschawillems.de/blog/2016/08/13/vulkan-tutorial...
Edit: okay surely with modern architecture there is no pixel write because of some early alpha cut but you still have to fetch the texture to make it so texture fetch (memory) will bottleneck first. I guess.
I don't think that at 200k or 400k level will matter much. Math is probably easier on humans if you think about the sprites as rectangular (so two triangles), but you could in principle make each sprite a triangle, and texture map in a shader a rectangular area of the triangle.
The CPU work would be O(n) and the rendering/GPU work O(m*k), where n is the number of bunnies, m is the display resolution and k is the size of our bunny sprite.
The advantage of this (in real applications utterly useless[1]) method is that CPU work only increases linearly with the number of bunnies, you get to discard bunnies you don't care about really early in the process, and GPU work is constant regardless of how many bunnies you add.
It's conceptually similar to rendering voxels, except you're not tracing rays deep, but instead sweeping wide.
As long as your GPU is fine with sampling that many surrounding pixels, you're exploiting the capabilities of both your CPU and GPU quite well. Also the CPU work can be parallelized: Each thread operates on a subset of the bunnies and on its own texture, and only in the final step the textures are combined into one (which can also be done in parallel!). I wouldn't be surprised if modern CPUs could handle millions of bunnies while modern GPUs would just shrug as long as the sprite is small.
[1] In reality you don't have sprites at constant sizes and also this method can't properly deal with transparency of any kind. The size of your sprites will be directly limited by how many surrounding pixels your shader looks up during rendering, even if you add support for multiple sprites/sprite sizes using other channels on your textures.
i got 200k sprites at 200fps on a 1070 (while recording). i'm not sure anyone could survive that many vampires
Do you have the code somewhere, I would like to see how it's made?
Example (not mine): https://www.shadertoy.com/view/tlB3zK
Curious how you are passing the data to the GPU - are you having a single dynamic vertex buffer that is uploaded each frame?
Is the vertex data a single position and the GPU is generating the quad from this?
for this bunnymark i have 1 VBO containing my 200k bunnies array (just positions). and 1 VBO containing just the 6 verts required to render a quad. turns out the VAO can just read from both of them like that. the processing is all on the CPU and just overwrites the bunnies VBO each frame
Using some slight shader/buffer trickery, and depending on what you're trying to do (as is always the case with games & rendering at this scale), you can easily get multiples of that -- and still stay >100FPS.
I agree, more of this approach is great. And I am totally flabbergasted at how abysmally poor the performance is with SpriteRenderer Unity's built-in sprite rendering technique.
That said, it's doable to get relatively high-performance with existing engines -- and the benefits they come with -- even if you can definitely, easily even, do better by "going direct".