undefined | Better HN

0 pointskstenerud5h ago0 comments

The problem is that this breaks down once you try to use SIMD instructions. I'd developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: https://github.com/kstenerud/bonjson/blob/05b91f6fe7d6b07186... ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches.

But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.

The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.

0 comments

nine_k5h ago

I think these are different use cases. If you talk about SIMD, you talk about the CPU and efficient processing of large numbers of integers. I think that when a solution like this crops up, it's about storage or transmission, and dense packing at the cost of non-uniformity. It's more like time-series databases pack numbers by delta encoding.

kstenerudOP5h ago

The thing is, most real-world numbers will fit within 1-3 bytes (even at 7 bits per byte), so ultradense packing doesn't actually buy much outside of benchmarks.

I spent WAYYYYYYYY too much time exploring this...

hansvm47m ago

I dunno. Varints in the wild tend to be misused, and there are external proto schemas at work we have to integrate with which would literally be both faster and smaller as gzipped json. They're misused because they have an API encouraging misuse -- compressing scalars rather than sequences. Varints are used because they can have reasonable developer ergonomics while sometimes improving computer metrics a twidge.

On top of that, for the vast majority of performance/cost parameter spaces, you're better off both in developer ergonomics and speed/space slapping zstd across a flatter binary format, supposing no better tool fits your use case better. Especially if your messages aren't exceptionally tiny. You're not using them in a raw DB or doing raw bulk analysis on varints (else basically zero choices of parameters make varints win out), so you're transferring them somewhere and decoding them. That decoding step, even for highly optimized solutions like bijou64, is on par with (slightly better than, if you have an older datacenter link) your raw network. If you spend 1s on networking, you spend 1s on parsing. That's a bad tradeoff almost always, and that assumes a good varint solution.

Even when varints make sense for some set of perf/cost parameters, it's still only for developer ergonomics 99.9999% of the time. Even simple changes like operating on a sequence of values rather than a single scalar enable vastly better CPU/space tradeoffs, and being willing to craft a proper data layout usually offers huge gains on top of that.

It's interesting that you pick delta encoding (or, its natural extension, double-delta encoding often being valuable) for time-series databases as an example. That's an obvious case where you have a solution which is extremely cheap in storage/network/CPU. Varints suck comparatively, almost always.

Not to rip on them too much, especially since it's nice to have primitives available which let you not have to do hard thinking for literally every problem, but they're not amazing and not a great default.

the-lazy-guy5h ago

Stored and transmitted data also has to be decoded. With modern datacenter hardware bottleneck is often CPU rather than network or disk (SSD). It depends on specific properties of the data. (I used to work on search index implementation which is about decoding and intersecting large amounts of hit-lists; and right SIMD-friendly varint encoding is obviously crucial)

jaen4h ago

This doesn't seem particularly hard to SIMD, especially when the CPU architecture has "compress/expand" horizontal instructions. The first byte fully encodes the length, which is not harder than the continuation bits of (U)LEB128. It's a basically a common length-prefixed encoding with an extra subtract added in, so someone has probably figured out an efficient algorithm.

It might be slightly more instructions than some other serial VL (variable-length) integer codec choices, but overall I don't think it's more difficult.

The very efficient SIMD VL codecs tend to stripe (separate) the control and data bits, so they're in a different design space anyway.

kstenerudOP4h ago

It can't be done, because the next bytes are dependent upon the first byte (which only works in limited circumstances, and where you have constant spacing between the values).

ULEB128 works in SIMD because there's only one dependent bit per byte, so you can speculatively decode and then correct later cheaply. Bijou requires you to check the first byte and then branch based on the value using all 8 bits in the decision matrix (to handle branches 0-247, 248, 249, 250, 251, 252, 253, 254, 255). This absolutely DESTROYS any parallelization opportunities.

Not to mention that non-canonical sized ints (3, 5, 6, 7) have abysmal performance compared to unaligned 2, 4, and 8 byte reads on modern processors.

jaen4h ago

Right, I think we have a slightly different definition of SIMD: You mean byte-parallel, I mean "doable with SIMD instructions". I also didn't imply the performance would be better than other methods...

Even though decoding the lengths must be serial (since's there's no unambiguous way to differentiate a tag and data byte), it's still doable within the wider SIMD registers, so there's some theoretical efficiency gain to be had (depending on the shape of the data).

On a general note, the continuation bit and prefix byte forms are equivalent, you just broadcast the prefix byte and compare against an increasing vector to convert it to a mask. Yeah, there's probably more fiddly SIMD if there are multiple prefixes in the register, but doable (it's just not byte-parallel, you eg. unroll the serial decode loop 8 times or whatever your maximum output byte width is, and mask out).

Simplified:

  // Just maps a byte to its position in the register
  __m128i idx = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
  // Broadcast the prefix
  __m128i nn = _mm_set1_epi8((char)prefix_byte);
  // Get applicable locations: prefix_byte contains the length, if byte_pos < len, the corresponding byte will be set
  __m128i m = _mm_cmpgt_epi8(nn, idx);
  // If you *really* want a high-bit mask:
  m = _mm_and_si128(m, _mm_set1_epi8((char)0x80));

1 more reply

itishappy2h ago

> The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.

Can you explain this part a bit? I feel like intuitively (and therefore probably incorrectly) these should have the same difficulties.

j / k navigate · click thread line to collapse

0 comments

nine_k5h ago

kstenerudOP5h ago

The thing is, most real-world numbers will fit within 1-3 bytes (even at 7 bits per byte), so ultradense packing doesn't actually buy much outside of benchmarks.

I spent WAYYYYYYYY too much time exploring this...

hansvm47m ago

the-lazy-guy5h ago

jaen4h ago

It might be slightly more instructions than some other serial VL (variable-length) integer codec choices, but overall I don't think it's more difficult.

The very efficient SIMD VL codecs tend to stripe (separate) the control and data bits, so they're in a different design space anyway.

kstenerudOP4h ago

It can't be done, because the next bytes are dependent upon the first byte (which only works in limited circumstances, and where you have constant spacing between the values).

Not to mention that non-canonical sized ints (3, 5, 6, 7) have abysmal performance compared to unaligned 2, 4, and 8 byte reads on modern processors.

jaen4h ago

Simplified:

  // Just maps a byte to its position in the register
  __m128i idx = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
  // Broadcast the prefix
  __m128i nn = _mm_set1_epi8((char)prefix_byte);
  // Get applicable locations: prefix_byte contains the length, if byte_pos < len, the corresponding byte will be set
  __m128i m = _mm_cmpgt_epi8(nn, idx);
  // If you *really* want a high-bit mask:
  m = _mm_and_si128(m, _mm_set1_epi8((char)0x80));

1 more reply

itishappy2h ago

> The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.

Can you explain this part a bit? I feel like intuitively (and therefore probably incorrectly) these should have the same difficulties.

j / k navigate · click thread line to collapse