Troubleshooting: Terminal Lag (opens in new tab)

(lock.cmpxchg8b.com)

227 pointsjanvdberg1y ago55 comments

55 comments

Love such articles where I learn something new. cdb is completely new to me. It's apparently the Microsoft Console Debugger. For others like me who were wondering how `eb win32u!NtUserSetLayeredWindowAttributes c3` neutered the window animation:

"By executing this command, you are effectively replacing the first byte of the `NtUserSetLayeredWindowAttributes` function with a `ret` instruction. This means that any call to `NtUserSetLayeredWindowAttributes` will immediately return without executing any of its original code. This can be used to bypass or disable the functionality of this function"

(Thanks to GitHub Copilot for that)

Also see https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

xelxebar1y ago

Nice. Here's a breakdown for anyone interested:

- eb[0] "enters bytes" into memory at the specified location;

- The RETN[1] instruction is encoded as C3 in x86 opcodes; and

- Debuggers will typically load ELF symbols so you can refer to memory locations with their names, i.e. function names refer to their jump target.

Putting those three together, we almost get the author's command. I'm not sure about the "win32u!NtUser" name prefix, though. Is it name-munging performed on the compiler side? Maybe some debugger syntax thrown in to select the dll source of the name?

[0]:https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

[1]:http://ref.x86asm.net/geek64.html#xC3

therein1y ago

Yes, NtUserSetLayeredWindowAttributes is in win32u.dll.

And if you are wondering what's the difference between win32u.dll and user32.dll.

> win32u.dll is a link for System calls between User mode (Ring 3) and Kernel mode (Ring 0) : Ring 3 => Ring 0 https://imgbb.com/L8FTP2C [0]

[0] - https://learn.microsoft.com/en-us/answers/questions/213495/w...

Joker_vD1y ago

The "win32u!" prefix is for the name of the DLL where the symbol lives. On Windows, the imported symbols are bound to their DLLs, instead of floating in the ether like they do on Linux where the dynamic loader just searches for them in whatever shared objects it has previously loaded.

txdv1y ago

So the root cause of the slowness was not found, it was just circumvented by keeping 3 xterms open and just using hiding/showing them?

imp0cat1y ago

But that does not make his solution any less valid. Or does it?

In fact, keeping something preloaded and ready to go is quite common, these two examples are off the top of my head:

- The Emacs server way - https://ungleich.ch/u/blog/emacs-server-the-smart-way/

- SSH connection reuse.

txdv1y ago

I agree pooling is a valid strategy. I just love those articles when people use some dark profiling magic to find something like misaligned memory causing severe and unexpected performance degradations.

aardshark1y ago

300ms for startup still sounds slow to me. Not ridiculously so, but it won't give that snappy feeling.

jlarocco1y ago

I thought so, too. I'm not interested enough to benchmark it, but for all practical purposes it's instantaneous on my machine. As fast to open a new terminal as it is to switch to the existing one.

macromaniac1y ago

Mine takes 50ms, assuming wsl is hot (recorded screen and compared mouse click frame to window pop up frame). I think op should try a different wsl distro or a blank machine and compare differences. I have on access scanning off, performance on, Ubuntu wsl distro, and windows 10.

1 more reply

Szpadel1y ago

interesting side note, our brain is compensating for delay, it can do it to around 250ms

so if anything lags up to that amount our brain will compensate and make it feel imstantenious

there was interesting experiment that I reproduced at university, create app that slowly build up delay to clicks to allow brain to adapt, and then remove it completely. result is that you have feeling that it reacts just before you actually click until brain adapts again to new timing

arghwhat1y ago

I don't think it's right to say that the compensation makes things feel instantaneous, but rather that we are able to still feel the association between input and result, allowing for coordinated feedback loops to be maintained. We do grow accustomed to the latency, but I do not think it is right to say that it feels like zero latency.

If the delay is long enough, the output does not just feel delayed, but entirely unrelated to the input.

A latency perception test involving a switch can easily be thrown off by a disconnect between the actual point of actuation vs. the end-users perceived point of actuation. For example, the user might feel - especially if exposed to a high system latency - that the switch actuation is after the button has physically bottomed out and squeezed with an increased force as if they were trying to mechanically induce the action, and later be surprised to realize that the actuation point was after less than half the key travel when the virtual latency is removed.

Without knowing the details of the experiment, I think this is a more likely explanation for a perception of negative latency: Not intuitively understanding the input trigger.

1 more reply

pcthrowaway1y ago

> I’ve been using this configuration for a few days, so far it’s working great. I haven’t noticed any issues running it this way.

The journey was very useful, even the destination may be pretty specific to your needs. The process of how to go about debugging minor annoyances like this is really hard to learn about.

zokier1y ago

Just for fun I did film some video footage from my 60Hz monitor to see how quickly my terminal starts up. Seems like 2-3 frames to show up the terminal window, and 1-2 frames to show shell prompt. So 50 ms - 83 ms. This is with foot terminal on Sway.

My very unscientific methodology was to run

    $ echo hello && foot

in a terminal and measure the time between the hello text appearing and the new window appearing. Looking at my video, the time from physical key press to "hello" text appearing might be 20ish ms but that is less clear, so about 100 ms total from key press to shell prompt.

This is pretty much completely untuned setup, I haven't done any tweaks to improve the figures. Enabling foot server might shave some milliseconds, but tbh I don't feel that's necessary.

It'd be fun to do this with better camera, and also with better monitors. Idk how difficult it would be to mod in some LED to keyboard to capture the exact moment the key is activated, just trying to eyeball the key movement is not very precise.

SushiHippie1y ago

Wouldn't it make more sense to screen record with wf-recorder instead of a video camera?

zokier1y ago

In the end click-to-photon latency is what matters, so measuring the whole system end to end is good starting point, and that means video camera pointing at screen. Something like wf-recoder sees only part of the whole pipeline. How much latency is there between compositor copying frame to wf-recorder and same frame getting pushed physically out on the display cable? Without knowing exactly how the whole system is built such question is difficult to answer.

But you also have to account for the fact that wf-recorder might interfere with the results, capturing screen is not free, and it might even push some part of the pipeline to less optimal paths. With video camera you can be fairly confident that measuring isn't interfering with anything.

1 more reply

mberning1y ago

This is a tour de force on the type of curiosity it takes to be really successful with computers.

beachy1y ago

I'm at the tail end of my career, so working on efficiency gains like this doesn't usually add up for me.

However I was interested in knowing whether it does for the author.

Assuming he/she does suffer this 1300 ms delay "hundreds" of times a day (let's say 200) and for the sake of argument they use their computer 300 days a year and have 20 years of such work ahead of them with this config, then this inefficiency will total 1300 x 200 x 300 x 20 / 1000 / 60 / 60 hours wasted during author's lifetime - some 430 hours.

So well worth the effort to fix!

anonymoushn1y ago

I find that annoyances cost me much more than the wall-clock time of the delay. You're lucky to disagree :)

sllabres1y ago

I had a printout of [1] at my office. Of course at is base it is only a simple multiplication table, but nevertheless is reminded me several time that a issue is worth fixing.

[1] https://xkcd.com/1205/

jd31y ago

I'm so distracted by latency that I run my macOS with vsync disabled 24/7 (through Quartz Debug).

When I used to use Windows 10+ years ago, I had decent luck using xming + cygwin + Cygwin/X + bblean to run xterm in a minimal latency/lag environment.

I also launch Chrome/Spotify/Slack desktop using:

$ open -a open -a Google\ Chrome --args --disable-gpu-vsync --disable-smooth-scrolling

abhinavk1y ago

One way to have the cake and eat it too is to upgrade to a high-refresh rate display. No tearing + less latency + smoother display. Although it's diminishing returns even 60Hz -> 144Hz+ will make a lot of difference. On a 240Hz display, vsync penalty is just 4ms.

Also if you are using a miniLED M-class MBP, its pixel response is abysmal.

fellerts1y ago

I've been running uncomposited X for years to reduce latency, but after getting a dual 120 Hz monitor setup, I might finally consider Wayland! This is good advice.

Too bad vscode doesn't support higher refresh rates. It's locked to 60 for some reason I haven't been able to grasp.

1 more reply

jd31y ago

Yep, have been planning to upgrade to a 240hz+ OLED for awhile! I find the typing input latency on my M1 MacBook Pro to be pretty abysmal when using the built-in retina display and no external monitor — I almost feel like I can only get work done when I have it plugged into my external monitor in clamshell mode and disable vsync.

2 more replies

bee_rider1y ago

It just seems so wasteful to run desktop and office programs at hundreds of hertz…

3 more replies

apazzolini1y ago

Out of curiosity, have you tried a 144hz monitor on macOS with vsync enabled?

jd31y ago

I use a cheap 75hz IPS from my office, though, ideally, I'd like to upgrade to a 240hz+ OLED w/ VRR since macOS now supports adaptive sync[0]; i've been waiting because I'm not satisfied w/ any of the OLED monitors currently on the market and my monitor upgrade request was denied by my employer.

Though I've used the Apple Magic Keyboard w/ Touch ID exclusively for awhile, I'm also thinking about upgrading to the new Wooting 80HE keyboard this fall since it has a 8kHz polling rate, analog hall effect switches, and is designed to be ultra low latency w/ tachyon mode enabled.

[0]: https://support.apple.com/guide/mac-help/use-adaptive-sync-w...

Borg31y ago

Very nice article, I love such debugging. I sometimes do it myself too.

Anyway, this also made me think about general bloat we have in new OSes and programs. Im still on old OS running spinning rust and bash here starts instantly when cache is hot. I think GUI designers lost an engineer touch...

aardshark1y ago

300ms for startup still sounds slow to me. Not ridiculously so, but it won't give that snappy feeling.

tonymet1y ago

We need a community of those obsessed with responsive applications. UI latency irks me on every device. Not only computers and smart phones, but now TVs, refrigerators, cars all have atrocious UI latency.

Great debugging work to come up with a solution!

pikseladam1y ago

it was a fun read

aftbit1y ago

Upvote just for teaching me about the existence of `hyperfine`.

    $ hyperfine 'alacritty -e true'
    Benchmark 1: alacritty -e true
      Time (mean ± σ):      84.1 ms ±   4.9 ms    [User: 40.1 ms, System: 30.8 ms]
      Range (min … max):    80.5 ms … 104.4 ms    32 runs
    
    $ hyperfine 'xterm -e true'
    Benchmark 1: xterm -e true
      Time (mean ± σ):      81.9 ms ±   2.6 ms    [User: 21.7 ms, System: 7.9 ms]
      Range (min … max):    74.9 ms …  87.1 ms    37 runs
    
    $ hyperfine 'wezterm -e true'
    Benchmark 1: wezterm -e true
      Time (mean ± σ):     211.7 ms ±  13.4 ms    [User: 41.4 ms, System: 60.0 ms]
      Range (min … max):   190.5 ms … 240.5 ms    15 runs

JNRowe1y ago

If we're handing out tips, then as noted in a few examples from the article hyperfine is even more useful when called with multiple commands directly. It presents a concise epilogue with the information you're probably trying to gleam from a run such as yours:

    $ hyperfine -L arg '1,2,3' 'sleep {arg}'
    …
    Summary
      sleep 1 ran
        2.00 ± 0.00 times faster than sleep 2
        3.00 ± 0.00 times faster than sleep 3

If your commands don't share enough in common for that approach then you can declare them individually, as in "hyperfine 'blib 1' 'blob x y' 'blub --arg'", and still get the summary.

LanternLight831y ago

i once used hyperfine to micro-bench elisp functions. i se $SHELL to a script that evaluated it's arguments in emacs by talking to a long-running session over a named pipe. Hyperfine runs a few no-ops with $SHELL and factored out the overhead, though it was still helpful to run a nested loop in elisp for finer results.

cryptonector1y ago

Besides learning about `hyperfine`, the combination of `xargs` to keep N warm processes ready, `LD_PRELOAD` to trick them into waiting to map their windows, and `pkill --oldest ...` to get one of those to go is quite neat.

But I have a very different solution to this problem: have just one terminal window and use and abuse `tmux`. I only use new windows (or tabs, if the terminal app has those) to run `ssh` to targets where I use `tmux`. I even nest `tmux` sessions, so essentially I've two levels of `tmux` sessions, and I title each window in the top-level session to match the name of the session running in that window -- this helps me find things very quickly. I also title windows running `vi` after the `basename` of the file being edited. Add in a simple PID-to-tmux window resolver script, scripts for utilities like `cscope` to open new windows, and this gets very comfortable, and it's fast. I even have a script that launches this whole setup should I need to reboot. Opening a new `tmux` window is very snappy!

pimlottc1y ago

Handy project link:

https://github.com/sharkdp/hyperfine

Shugyousha1y ago

I also didn't know about `hyperfine`, very nice!

Even 80ms seems unnecessarily slow to me. 300ms would drive me nuts ...

I'm using a tiling window manager (dwm) and interestingly the spawning time varies depending on the position that the terminal window has to be rendered to.

The fastest startup time I get on the fullscreen tiling mode.

   hyperfine 'st -e true'
   Benchmark 1: st -e true
     Time (mean ± σ):      35.7 ms ±  10.0 ms    [User: 15.4 ms, System: 4.8 ms]
     Range (min … max):    17.2 ms …  78.7 ms    123 runs

The non-fullscreen one ends up at about 60ms which still seems reasonable.

JNRowe1y ago

You could maybe find out where the delay is by using st's Xembed support? Create a window with tabbed¹ in a tiling layout, open st in to it with "st -w <xid> -e true". If it is close to the monocle time, it is probably the other windows handling the resize event that is causing the slowdown not the layout choice.

To prove it to myself: I'm using river² and I can see a doubling-ish of startup time with foot³, iff I allow windows from heavier apps to handle the resize event immediately. If the time was a little longer(or more common) I'd be tempted to wrap the spawn along the lines of "kill -STOP <other_clients_in_tag>; <spawn & hold for map>; kill -CONT <other_clients_in_tag>" to delay the resize events until my new window was ready. That way the frames still resize, but their content resize is delayed.

¹ https://tools.suckless.org/tabbed/

² https://codeberg.org/river/river

³ https://codeberg.org/dnkl/foot

aftbit1y ago

The result of running the same on st for me:

    Benchmark 1: st -e true
      Time (mean ± σ):      35.4 ms ±   6.9 ms    [User: 15.1 ms, System: 3.8 ms]
      Range (min … max):    24.2 ms …  65.2 ms    114 runs

This is on awesome-wm with the window opening as the 3rd tiled window on a monitor, which means it has to redraw at least the other two windows. I'm also running xfs on top of luks/dm-crypt for my filesystem, which shouldn't matter too much on this benchmark thanks to the page cache, but is a relatively common source of performance woes on this particular system. I really ought to migrate back to unencrypted ext4 and use my SSD's encryption but I haven't wanted to muck with it.

1 more reply

oguz-ismail1y ago

Does it parse commands and call exec*() or spawn a new shell for every run of every command?

JNRowe1y ago

You can choose the behaviour with the --shell option¹. The default behaviour is nice because it allows you to benchmark pipelines easily, but if you want to change it you can.

¹ https://github.com/sharkdp/hyperfine#intermediate-shell

j / k navigate · click thread line to collapse

55 comments

vijucat1y ago

(Thanks to GitHub Copilot for that)

Also see https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

xelxebar1y ago

Nice. Here's a breakdown for anyone interested:

- eb[0] "enters bytes" into memory at the specified location;

- The RETN[1] instruction is encoded as C3 in x86 opcodes; and

- Debuggers will typically load ELF symbols so you can refer to memory locations with their names, i.e. function names refer to their jump target.

[0]:https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

[1]:http://ref.x86asm.net/geek64.html#xC3

therein1y ago

Yes, NtUserSetLayeredWindowAttributes is in win32u.dll.

And if you are wondering what's the difference between win32u.dll and user32.dll.

> win32u.dll is a link for System calls between User mode (Ring 3) and Kernel mode (Ring 0) : Ring 3 => Ring 0 https://imgbb.com/L8FTP2C [0]

[0] - https://learn.microsoft.com/en-us/answers/questions/213495/w...

Joker_vD1y ago

txdv1y ago

So the root cause of the slowness was not found, it was just circumvented by keeping 3 xterms open and just using hiding/showing them?

imp0cat1y ago

But that does not make his solution any less valid. Or does it?

In fact, keeping something preloaded and ready to go is quite common, these two examples are off the top of my head:

- The Emacs server way - https://ungleich.ch/u/blog/emacs-server-the-smart-way/

- SSH connection reuse.

txdv1y ago

aardshark1y ago

300ms for startup still sounds slow to me. Not ridiculously so, but it won't give that snappy feeling.

jlarocco1y ago

I thought so, too. I'm not interested enough to benchmark it, but for all practical purposes it's instantaneous on my machine. As fast to open a new terminal as it is to switch to the existing one.

macromaniac1y ago

1 more reply

Szpadel1y ago

interesting side note, our brain is compensating for delay, it can do it to around 250ms

so if anything lags up to that amount our brain will compensate and make it feel imstantenious

arghwhat1y ago

If the delay is long enough, the output does not just feel delayed, but entirely unrelated to the input.

Without knowing the details of the experiment, I think this is a more likely explanation for a perception of negative latency: Not intuitively understanding the input trigger.

1 more reply

pcthrowaway1y ago

> I’ve been using this configuration for a few days, so far it’s working great. I haven’t noticed any issues running it this way.

The journey was very useful, even the destination may be pretty specific to your needs. The process of how to go about debugging minor annoyances like this is really hard to learn about.

zokier1y ago

My very unscientific methodology was to run

    $ echo hello && foot

This is pretty much completely untuned setup, I haven't done any tweaks to improve the figures. Enabling foot server might shave some milliseconds, but tbh I don't feel that's necessary.

SushiHippie1y ago

Wouldn't it make more sense to screen record with wf-recorder instead of a video camera?

zokier1y ago

1 more reply

mberning1y ago

This is a tour de force on the type of curiosity it takes to be really successful with computers.

beachy1y ago

I'm at the tail end of my career, so working on efficiency gains like this doesn't usually add up for me.

However I was interested in knowing whether it does for the author.

So well worth the effort to fix!

anonymoushn1y ago

I find that annoyances cost me much more than the wall-clock time of the delay. You're lucky to disagree :)

sllabres1y ago

I had a printout of [1] at my office. Of course at is base it is only a simple multiplication table, but nevertheless is reminded me several time that a issue is worth fixing.

[1] https://xkcd.com/1205/

jd31y ago

I'm so distracted by latency that I run my macOS with vsync disabled 24/7 (through Quartz Debug).

When I used to use Windows 10+ years ago, I had decent luck using xming + cygwin + Cygwin/X + bblean to run xterm in a minimal latency/lag environment.

I also launch Chrome/Spotify/Slack desktop using:

$ open -a open -a Google\ Chrome --args --disable-gpu-vsync --disable-smooth-scrolling

abhinavk1y ago

Also if you are using a miniLED M-class MBP, its pixel response is abysmal.

fellerts1y ago

I've been running uncomposited X for years to reduce latency, but after getting a dual 120 Hz monitor setup, I might finally consider Wayland! This is good advice.

Too bad vscode doesn't support higher refresh rates. It's locked to 60 for some reason I haven't been able to grasp.

1 more reply

jd31y ago

2 more replies

bee_rider1y ago

It just seems so wasteful to run desktop and office programs at hundreds of hertz…

3 more replies

apazzolini1y ago

Out of curiosity, have you tried a 144hz monitor on macOS with vsync enabled?

jd31y ago

[0]: https://support.apple.com/guide/mac-help/use-adaptive-sync-w...

Borg31y ago

Very nice article, I love such debugging. I sometimes do it myself too.

aardshark1y ago

300ms for startup still sounds slow to me. Not ridiculously so, but it won't give that snappy feeling.

tonymet1y ago

Great debugging work to come up with a solution!

pikseladam1y ago

it was a fun read

aftbit1y ago

Upvote just for teaching me about the existence of `hyperfine`.

    $ hyperfine 'alacritty -e true'
    Benchmark 1: alacritty -e true
      Time (mean ± σ):      84.1 ms ±   4.9 ms    [User: 40.1 ms, System: 30.8 ms]
      Range (min … max):    80.5 ms … 104.4 ms    32 runs
    
    $ hyperfine 'xterm -e true'
    Benchmark 1: xterm -e true
      Time (mean ± σ):      81.9 ms ±   2.6 ms    [User: 21.7 ms, System: 7.9 ms]
      Range (min … max):    74.9 ms …  87.1 ms    37 runs
    
    $ hyperfine 'wezterm -e true'
    Benchmark 1: wezterm -e true
      Time (mean ± σ):     211.7 ms ±  13.4 ms    [User: 41.4 ms, System: 60.0 ms]
      Range (min … max):   190.5 ms … 240.5 ms    15 runs

JNRowe1y ago

    $ hyperfine -L arg '1,2,3' 'sleep {arg}'
    …
    Summary
      sleep 1 ran
        2.00 ± 0.00 times faster than sleep 2
        3.00 ± 0.00 times faster than sleep 3

If your commands don't share enough in common for that approach then you can declare them individually, as in "hyperfine 'blib 1' 'blob x y' 'blub --arg'", and still get the summary.

LanternLight831y ago

cryptonector1y ago

pimlottc1y ago

Handy project link:

https://github.com/sharkdp/hyperfine

Shugyousha1y ago

I also didn't know about `hyperfine`, very nice!

Even 80ms seems unnecessarily slow to me. 300ms would drive me nuts ...

I'm using a tiling window manager (dwm) and interestingly the spawning time varies depending on the position that the terminal window has to be rendered to.

The fastest startup time I get on the fullscreen tiling mode.

   hyperfine 'st -e true'
   Benchmark 1: st -e true
     Time (mean ± σ):      35.7 ms ±  10.0 ms    [User: 15.4 ms, System: 4.8 ms]
     Range (min … max):    17.2 ms …  78.7 ms    123 runs

The non-fullscreen one ends up at about 60ms which still seems reasonable.

JNRowe1y ago

¹ https://tools.suckless.org/tabbed/

² https://codeberg.org/river/river

³ https://codeberg.org/dnkl/foot

aftbit1y ago

The result of running the same on st for me:

    Benchmark 1: st -e true
      Time (mean ± σ):      35.4 ms ±   6.9 ms    [User: 15.1 ms, System: 3.8 ms]
      Range (min … max):    24.2 ms …  65.2 ms    114 runs

1 more reply

oguz-ismail1y ago

Does it parse commands and call exec*() or spawn a new shell for every run of every command?

JNRowe1y ago

You can choose the behaviour with the --shell option¹. The default behaviour is nice because it allows you to benchmark pipelines easily, but if you want to change it you can.

¹ https://github.com/sharkdp/hyperfine#intermediate-shell

j / k navigate · click thread line to collapse