undefined | Better HN

0 pointsAsm2D21d ago0 comments

Because pg_jitter uses AsmJit's Compiler, which also allocates registers. That's much more work than using hardcoded physical registers in SLJIT case. There is always a cost of such comfort.

I think AsmJit's strength is completeness of its backends as you can emit nice SIMD code with it (like AVX-512). But the performance could be better of course, and that's possible - making it 2x faster would be possible.

0 comments

vladich21d ago

There are other issues with that auto-allocation. I tested all 3 backends on very large queries (hundreds of KBs) per query. Performance of all of them (+LLVM, but -sljit) was abysmal - the compiler overhead was in seconds to tens(!) of seconds. They have some non-linear components in their optimization algorithms. While sljit was scaling linearly and almost as fast as for smaller queries. So yes, it gives higher run-time performance but the cost of that performance grows non-linearly with code size and complexity. While you still can have good performance with manual allocations. I also don't believe you can make AsmJit 2x faster without sacrificing that auto-allocation algorithm.

Asm2DOP21d ago

AsmJit has only one place where a lot of time is spent - bin-packing. It's the least optimized part, which has quadratic complexity (at the moment), which starts to show when you have like hundreds of thousands of virtual registers. There is even a benchmark in AsmJit called `asmjit_bench_regalloc`, which shows that a single function that has 16MB alone, with 65k labels and 200k virtual registers takes 2.2 seconds to generate (and 40ms of that is time to just call `emit()`).

If this function is optimized, or switched to some other implementation when there is tens of thousands of virtual registers, you would get orders of magnitude faster compilation.

But realistically, which query requires tens of megabytes of machine code? These are pathological cases. For example we are talking about 25ms when it comes to a single function having 1MB of machine code, and sub-ms time when you generate tens of KB of machine code.

So from my perspective the ability to generate SIMD code that the CPU would execute fast in inner loops is much more valuable than anything else. Any workload, which is CPU-bound just deserves this. The question is how much the CPU bound the workload is. I would imagine databases like postgres would be more memory-bound if you are processing huge rows and accessing only a very tiny part of each row - that's why columnar databases are so popular, but of course they have different problems.

I worked on one project, which tried to deal with this by using buckets and hashing in a way that there would be 16 buckets, and each column would get into one of these, to make the columns closer to each other, so the query engine needs to load only buckets used in the query. But we are talking about gigabytes of RAW throughput per core in this case.

vladich21d ago

I have a test of 200Kb query that AsmJit takes 7 seconds to compile (that's not too bad both LLVM and MIR take ~20s), while sljit does it in 50ms. 200Kb is a pathological case, but it's not unheard of in the area I'm working on. It's realistic, although a rare case. Last 10-15 years most OLTP workloads became CPU bound, because active datasets of most real databases fully fit in memory. There are exceptions, of course.

1 more reply

vladich21d ago

SLJIT is a bit smarter than just to use hardcoded registers. It's multi-platform anyway, so it uses registers when they are available on the target platform, if not it will use memory, that's why performance can differ between Windows and Linux on x64 for example - different number of available registers.

Asm2DOP21d ago

Indeed, but this also means that you would get drastically different performance on platforms that have more physical registers vs on platforms that have less. For example x86_64 only has 16 GP registers, while AArch64 has 32 - if you use 25 registers without any analysis and just go to stack with 10 of them, the difference could be huge.

But... I consider SLJIT to be for a different use-case than AsmJit. It's more portable, but its scope is much more limited.

vladich21d ago

It's definitely different, and for Postgres specifically, they may complement each other. SLJit can be used for low latency queries where codegen time is more important than optimizations, also for other platforms like s390x / PPC / SPARC, etc. AsmJit can be used for SIMD optimizations for x86_64 and ARM64. MIR is kinda in the middle - it does auto-allocations of registers, doesn't support SIMD, but also it's multiplatform. The only thing that doesn't fit well here is LLVM :). It has some advantages in some edge cases, but... It really needs a separate provider, the current one is bad. I'll probably create another LLVM backend for pg_jitter in the future to utilize it properly...

vladich21d ago

Good point about SIMD opportunities though - it's something other 2 JITs lack.

j / k navigate · click thread line to collapse

0 comments

vladich21d ago

Asm2DOP21d ago

If this function is optimized, or switched to some other implementation when there is tens of thousands of virtual registers, you would get orders of magnitude faster compilation.

vladich21d ago

1 more reply

vladich21d ago

Asm2DOP21d ago

But... I consider SLJIT to be for a different use-case than AsmJit. It's more portable, but its scope is much more limited.

vladich21d ago

Good point about SIMD opportunities though - it's something other 2 JITs lack.

j / k navigate · click thread line to collapse