undefined | Better HN

0 pointsaeonfox6mo ago0 comments

> Note that in the iPhone cases, the glass blur is mostly an internal widget rendered by the app, what is emitted to the display server/hardware is opaque.

This sounds wild to me, so I'm just going to ask. Do you work on these kind of optimisations for a modern OS? If so, just ignore my ponderings and I'll just accept what you're saying here.

I honestly couldn't imagine this kind of compositing not happening completely on the GPU or requiring any back and forth between the CPU and GPU. That is, the windowing system creates a display list, and that display list is dispatched to the GPU along with any assets it requires (icons, font etc.). I'd also imagine this is the same as how the browser renders.

As for optimisations, if the display list is the same for a particular render target (e.g., window, widget, subsection, or entire screen), there's no reason to rerender it. There's no reason to even rebuild the display list for an application that is asleep or backgrounded. Tile-based culling and selective update of the screen buffer^ can also happen at the GPU level. Though hierarchical culling at the CPU level would be trivial and low-cost.

This is not my wheelhouse, so perhaps I'm missing something crucial here.

^ Edit: It does look like the Apple silicon GPUs do use tile-based deferred rendering.

https://developer.apple.com/documentation/metal/tailor-your-...

0 comments

arghwhat6mo ago

> Do you work on these kind of optimisations for a modern OS

I have worked on modern display servers and application interfaces, and as such also dealt with (but not written) a fair amount of client application render code, generally optimizing for power consumption and latency.

> I honestly couldn't imagine this kind of compositing not happening completely on the GPU or requiring any back and forth between the CPU and GPU.

Well the CPU is always involved, and even responsible for rendering certain assets that are not viable to render on the GPU - fonts in particular are usually rendered by the CPU, with occasional attempts at GPU rendering being made such as servo's pathfinder - but for simplicity let's talk about only widgets rendered by the GPU.

In most cases[^1], a window is rendering to a single "render target" (texture/buffer), handing off a flat and opaque buffer to the display server which the server will either try to display directly (direct scanout) or composite with other windows. In this context, the display server's main purpose is to have exclusive control over display hardware, and apart from compositing multiple windows together it is not involved in render processes.

The application itself when rendering will normally walk its widget tree, accumulate damage and cull widgets to create some form of a render list. Depending on the toolkit and graphics API in question, you'll ultimately end up submitting an amount of GPU work (e.g., a render pass) to render your window buffer (an IOSurface or DMA-BUF), and you then send the buffer to the display server one way or another. The window buffer will become ready later once the asynchronous render tasks complete, and the display server will wait on the relevant fences (on the CPU side, which is responsible for most GPU scheduling tasks) before starting any render task that would texture that buffer or before attempting scanout from the buffer.

The problem with blur is that you have a render task that depends on the full completion of all the prior render tasks, as its shader must read the output buffer[^2] as it completed rendering to an intermediate state. Additionally, other render steps depend on that render task as it has to be overlaid on top of the blur widget, and only after that completes is the buffer ready for the display server. That's a pipeline stall, and because it's on top of the primary content, it's holding up every frame from that app, and due to the blur operation itself an update that before only affected one tile now affects several.

Reading your own output is something you avoid like the plague, and blur is that. If you're used to web/network development, think of it like blocking network roundtrips.

... well this turned out to be a wall of text ...

---

^1: The more advanced case of hardware compositing is where you send a small number of buffers to the display server, e.g. a video buffer, an underlay with some background and an overlay with some controls, and have the display server configure hardware planes using those buffers such that the window is stitched together as the signal is about to be sent to the display. This is not the general case as planes are limited in count and capability, they cannot perform any effects other than basic transforms and blending, and scanout hardware is very picky about what it can use as input.)

^2: One could implement this instead by creating a different render list for just the area the blur needs to sample instead, in the hopes that this will render much faster and avoid waiting on completion of the primary buffer, but that would be an app specific optimization with a lot of limitations that may end up being much slower in many scenarios.

aeonfoxOP6mo ago

> and the display server will wait on the relevant fences (on the CPU side, which is responsible for most GPU scheduling tasks) before starting any render task that would texture that buffer or before attempting scanout from the buffer.

Given the fetching and compute performance per watt of modern GPUs I'm still surprised that the watts saving of reducing overdraw is anything but negligible, and certainly if you're talking about pipeline stalls, having pixel data shuttling over the bus between the GPU and CPU seems like a much bigger deal?

> ^2: One could implement this instead by creating a different render list for just the area the blur needs to sample instead, in the hopes that this will render much faster and avoid waiting on completion of the primary buffer, but that would be an app specific optimization with a lot of limitations that may end up being much slower in many scenarios.

It looks like Apple Silicon avoids the overdraw problem with TBDR, and the tile system would efficiently manage the dependency chain right back to the desktop background if needed. So if a browser is maximised over a bunch of other windows, only a portion of a portion of render targets are being sampled, with no intermediate CPU rendering.

To me, the flex by Apple here is that they can do this efficiently, because their rendering system is likely fully GPU and also resource efficient in a way that other typical display servers and GPUs can't be. For this to work on Linux or Windows, a complete refactoring of the display servers would be required, and it would only service GPUs that have tile-based deferred rendering, which seems to be nil outside of Apple's Silicon (and their older PowerVR chips).

j / k navigate · click thread line to collapse