io_uring, libaio performance across Linux kernels and an unexpected IOMMU trap (opens in new tab)

(blog.ydb.tech)

65 pointstanelpoder1d ago16 comments

16 comments

I understand that it's the interrupt-based I/O completion workloads that suffered from IOMMU overhead in your tests?

IOMMU may induce some interrupt remapping latency, I'd be interested in seeing:

1) interrupt counts (normalized to IOPS) from /proc/interrupts

2) "hardirqs -d" (bcc-tools) output for IRQ handling latency histograms

3) perf record -g output to see if something inside interrupt handling codepath takes longer (on bare metal you can see inside hardirq handler code too)

Would be interesting to see if with IOMMU each interrupt handling takes longer on CPU (or is the handling time roughly the same, but interrupt delivery takes longer). There may be some interrupt coalescing thing going on as well (don't know exactly what else gets enabled with IOMMU).

Since interrupts are raised "randomly", independently from whatever your app/kernel code is running on CPUs, it's a bit harder to visualize total interrupt overhead in something like flamegraphs, as the interrupt activity is all over the place in the chart. I used flamegraph search/highlight feature to visually identify how much time the interrupt detours took during stress test execution.

Example here (scroll down a little):

https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usag...

eivanov891d ago

BTW, the whole situation with IRQ accounting disabled reminds me the -fomit-frame-pointer case. For a long time there was no practical performance reason, but the option had been used... Making slower and harder to build stacks both for perf analyses and for stack unwinding in languages like C++.

After careful reading I'm surprised how small IRQ squares build up 30%. Should search for interrupts when I inspect our flamegraphs next time.

tanelpoderOP1d ago

I was doing over 11M IOPS during that test ;-)

Edit: I wrote about that setup and other Linux/PCIe root complex topology issues I hit back in 2021:

https://news.ycombinator.com/item?id=25956670

2 more replies

eivanov891d ago

Unfortunately, we don't have a proper measurements for IOPOLL mode with and without IOMMU, because initially we didn't configure IOPOLL properly. However, I bet that this mode will be affected as well, because disk still has to write using IOMMU.

You suggest a very interesting measurements. I will keep it in my mind and try during next experiments. Wish I have read this before to apply during the past runs :)

tanelpoderOP1d ago

Yeah you'd still have the IOMMU DMA translation, but would avoid the interrupt overhead...

eivanov891d ago

Dear folks, I'm the author of that post.

A short summary below.

We ran fio benchmarks comparing libaio and io_uring across kernels (5.4 -> 7.0-rc3). The most surprising part wasn’t io_uring gains (~2x), but a ~30% regression caused by IOMMU being enabled by default between releases.

Happy to share more details about setup or reproduce results.

jcalvinowens1d ago

Thanks for sharing this.

Was the iommu using strict or lazy invalidation? I think lazy is the default but I'm not sure how long that's been true.

eivanov891d ago

We compared IOMMU fully disabled vs enabled. When it is enabled, I expect it to be lazy (should be the default for IOMMU). Note, that we recommend to use passthrough to completely bypass translation for most devices independent on strict/lazy mode.

hcpp1d ago

Why was 4K random write chosen as the main workload, and would the conclusion change with sequential I/O?

eivanov891d ago

That's a popular DBMS pattern. We chosen writes over reads, because on many NVMe devices writes are faster and it is easier to measure software latency.

I guess that in case of sequential I/O result would be similar. However with larger blocks and less IOPS the difference might be smaller.

menaerus1d ago

So perhaps a mixed read+write workload would be more interesting, no? Write-only is characteristic of ingestion workloads. That said, libaio vs io_uring difference is interesting. Did you perhaps run a perf profile to understand where the differences are coming from? My gut feeling is that it is not necessarily an artifact of less context-switching with io_uring but something else.

1 more reply

skavi1d ago

what was the security situation of whatever is now being protected by the IOMMU before it was enabled by default?

eivanov891d ago

When IOMMU is not enabled, any PCIe device capable of DMA could access arbitrary physical memory. It allows to read any sensitive data, modifying memory and fully compromising the system without CPU involvement.

There are many DMA-based attacks described in the literature. Even with IOMMU, some attacks are still possible due to misconfiguration or incomplete isolation. For example: https://www.repository.cam.ac.uk/items/13dcaac4-5a3d-4f67-82...

In our case, we didn’t dive deeply into the security aspects. Our typical deployment assumes a trusted environment where YDB runs on dedicated hardware, so performance considerations tend to dominate.

j / k navigate · click thread line to collapse

16 comments

tanelpoderOP1d ago

I understand that it's the interrupt-based I/O completion workloads that suffered from IOMMU overhead in your tests?

IOMMU may induce some interrupt remapping latency, I'd be interested in seeing:

1) interrupt counts (normalized to IOPS) from /proc/interrupts

2) "hardirqs -d" (bcc-tools) output for IRQ handling latency histograms

3) perf record -g output to see if something inside interrupt handling codepath takes longer (on bare metal you can see inside hardirq handler code too)

Example here (scroll down a little):

https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usag...

eivanov891d ago

After careful reading I'm surprised how small IRQ squares build up 30%. Should search for interrupts when I inspect our flamegraphs next time.

tanelpoderOP1d ago

I was doing over 11M IOPS during that test ;-)

Edit: I wrote about that setup and other Linux/PCIe root complex topology issues I hit back in 2021:

https://news.ycombinator.com/item?id=25956670

2 more replies

eivanov891d ago

You suggest a very interesting measurements. I will keep it in my mind and try during next experiments. Wish I have read this before to apply during the past runs :)

tanelpoderOP1d ago

Yeah you'd still have the IOMMU DMA translation, but would avoid the interrupt overhead...

eivanov891d ago

Dear folks, I'm the author of that post.

A short summary below.

Happy to share more details about setup or reproduce results.

jcalvinowens1d ago

Thanks for sharing this.

Was the iommu using strict or lazy invalidation? I think lazy is the default but I'm not sure how long that's been true.

eivanov891d ago

hcpp1d ago

Why was 4K random write chosen as the main workload, and would the conclusion change with sequential I/O?

eivanov891d ago

That's a popular DBMS pattern. We chosen writes over reads, because on many NVMe devices writes are faster and it is easier to measure software latency.

I guess that in case of sequential I/O result would be similar. However with larger blocks and less IOPS the difference might be smaller.

menaerus1d ago

1 more reply

skavi1d ago

what was the security situation of whatever is now being protected by the IOMMU before it was enabled by default?

eivanov891d ago

j / k navigate · click thread line to collapse