RaptorQ and performance optimization in Rust (opens in new tab)

(cberner.com)

127 pointsraccoonone7y ago48 comments

48 comments

When comparing a Rust rewrite of a C++ codebase, I noticed that Rust's default HashMap and HashSet were noticably slower than unordered_map and unordered_set. The reason is that Rust uses a hash function that has better properties, but that is a bit slower. If the performance of a HashMap is a bottleneck, one can use a simpler hash function to get the same performance as in C++. With FnvHashMap, the performance was the same as unordered_map.

v_lisivka7y ago

Try hashbrown. Author declares 8x improvement over standard HashMap.

https://github.com/Amanieu/hashbrown

https://blog.waffles.space/2018/12/07/deep-dive-into-hashbro...

https://gankro.github.io/blah/hashbrown-insert/

steveklabnik7y ago

It soon will be the standard hashmap!

1 more reply

lower7y ago

Thanks! I just replaced FnvHashMap with hashbrown::HashMap and get a 20% overall performance improvement in my workload (which is rather HashMap-heavy).

jjnoakes7y ago

Slightly tangential but c++ hash tables aren't even the fastest. The requirements on iterator invalidation force the use of extra allocations and indirections for separate chaining.

I've hand-rolled and/or reused hash tables that use various probing methods (removing both allocations and extra indirections) and the performance improvement is non-trivial (for my workloads anyhow).

ummonk7y ago

I've been surprised by how slow the C++ standard data structures are (likely due to the need to support a lot of functionality). I once wrote a basic heap in C and tried benchmarking its performance at -O3 relative to the C++ STL heap. Was expecting mine to be slower compared to the highly optimized STL heap, but was wondering how much slower. Instead, it turned out to be 1.5-2x as fast (and this was despite freeing up memory when the underlying vector shrunk - something which the C++ vector does not do).

Luker887y ago

I have a C++ implementation of Raptorq.

Never went for performance, but at the most I use `std::map<int, pointer>` (which is a RB-tree).

You can implement RQ without a hashmap.

It is only used to track the symbol number (uint32), so introducing hashing sounds a bit wasteful. You could also do a normal vector actually, but since the symbol numbers are taken from the network, that might result in a lot of reordering or pointless allocation if you are not really careful

--edit: typos

Luker887y ago

I have played with a C++ implementation of RFC6330, and while I have to say I never went for performance, I find this benchmark a bit... pointless.

RaptorQ is kind of a gaussian elimination of a matrix, so it all depends on the block size (=>matrix size). The algorithm has basically cubic complexity on the number of symbols in a block. RFC6330 is made to work on files, which are divided into blocks with a certain number of symbols, and the bytes are interleaved.

This implementation does not do the (complex and almost pointless) interleaving, which is fine, even OpenRQ does not.

The bench seems to be done on a.... 10kb file? It all fits in the L2. We are not given the symbol size (which determines the block size!) and I assume all of this fits in a 10x10 matrix.

You are benchmarking operations on a matrix that is (more or less) a 10x10 byte matrix.

The biggest part of this benchmark might almost be the generation of repair symbols (was it even done?), since that would require multiple xoring of the above-mentioned symbols.

This is much closer to micro-benchmarking than an actual benchmark, imho. It would have been more interesting to see what happens with files at least larger than the L3 cache.

You can also cache intermediate results, (which he does not do) which is especially useful for encoding, but only when working on matrix >= 100x100, otherwise just searching the cache, getting from memory (my implementation optionally did LZ4 compression/decompression) and doing a matrix multiplication is slower than just computing the matrix again.

Still, it's nice to see implementations of the RFC, which is a real pain to read...

sudhirj7y ago

Does anyone know of a good (preferably pure) Go implementation? https://github.com/google/gofountain has mad a good start and implemented the older plain Raptor standard, but I haven't been able to find a good Go library.

If anyone is qualified enough to take a crack at it I can try and arrange for sponsorship as well.

nullc7y ago

RaptorQ, even in this implementation, appears to be a lot slower than wirehair: https://github.com/catid/wirehair

Luker887y ago

Thanks for the pointer, any other similar algorithms?

O(n) sounds impressive, although I heard such claims on RQ too...

RQ complexity is O(n^3), but on the internal matrix, not on the input. It all depends on the blocksize you choose.

The numbers are impressive, too. Encoding 40Mbytes in 0usec? Will have to check the theory behind that.

nullc7y ago

You could look at ldpc triangle/staircase. They're fast even in a fairly naive implementation but have fairly high overhead unless most of your input is the systematic part of the codeword.

mike006327y ago

For anyone interested in how this code works, here is Qualcomm's technical overview of RaptorQ: https://www.qualcomm.com/media/documents/files/raptorq-techn...

eridius7y ago

I don't know much about SIMD but are there no alignment concerns with reinterpreting a slice as an AVX vector?

ajtulloch7y ago

Unaligned and aligned loads of AVX vectors have basically the same performance since Ivy Bridge IIRC.

jnordwick7y ago

I was under the impression that unaligned ops ran at the same speed, but they used up more register ports, so it does reduce memory bandwidth between the register file and cache. Or does this no longer apply either?

1 more reply

tgtweak7y ago

Very cool efforts here. Curious what the speed is without those instructions present on the cpu, and what hardware this was run on.

The concept of a fountain seems interesting but what is a good use case? Variable strength error correction?

sudhirj7y ago

Let's say you want to send something (say 100KB) to a million listeners, but you don't know which of them is going to be listening when. You'd feed that 100KB into a RaptorQ encoder, configure it to 1KB packets, and it would give you stream of near-infinite 1KB packets that you could broadcast (usually over a satellite, but UDP, multicast, QR codes all work).

Receivers would listen for as many of these packets as they can, as and when they can (the transmission can be "lossy"), and if a receiver pickups up 100 unique packets, any 100 unique packets in any order, it'll have a 99.9% (or something along those lines) chance of decoding the message successfully. Each extra unique packet adds a 9 to the chances.

That's why the "fountain". It's a data fountain, and you can grab any quantity of data off it at any point and still have a good chance of reconstructing the message.

tgtweak7y ago

Very interesting. Still struggling to see how this would be more efficient than looping that data for the use case of getting beacon data out. Perhaps it's more consistent/predictable in scenarios of high or targeted loss (ie first X bytes keep getting dropped)?

3 more replies

bb887y ago

Multicast reliable transmission that scales well to many many clients.

minimax7y ago

Can anyone suggest a good book that would serve as an introduction to finite field arithmetic? I keep randomly running into it (e.g. this post), but don't understand it well enough to follow the discussion.

ignoramous7y ago

If you would prefer video, then UoCambridge's recording of "Lecture 5: Entropy and Data Compression (IV): Shannon's Source Coding Theorem, Symbol Codes and Arithmetic Coding" is up here: http://videolectures.net/mackay_course_05/

In case you don't find what you're looking for, here are all lectures on "Information Theory, Pattern Recognition, and Neural Networks": http://videolectures.net/course_information_theory_pattern_r...

I'd be remiss not to mention that the lectures are by the late David MacKay: https://news.ycombinator.com/item?id=11500221

layoutIfNeeded7y ago

Huffman: Fundamentals of Error-Correcting Codes

ISBN 9780521782807

empath757y ago

Anybody want to comment on whether all of his unsafe code was actually necessary? Seems bad for rust that safe code is 25x slower.

MaulingMonkey7y ago

> Seems bad for rust that safe code is 25x slower.

Most of the optimizations listed have nothing to do with unsafe, and have to do with algorithmic changes that you could apply in any language, and could apply to safe code just as well as to unsafe code. Even if we pretend all unsafe used there is necessary, and interpret the source in the worst possible light for rust - that the optimization steps would be impossible without unsafe - I think we'd only be talking 4x to 5x slower from this article, not 25x.

But - I strongly suspect not all that unsafe is necessary, and the post certainly doesn't claim it is.

The first code snippet uses unsafe to skip bounds checks. There are plenty of other ways to get the optimizer to skip them for you in safe code - or at the very least, hoist them out of the loop instead. This seems like a good read showing some concrete examples:

https://coaxion.net/blog/2018/01/speeding-up-rgb-to-grayscal...

Explicitly using SIMD code like the second code snippet does might require unsafe for the transmuted slicing under the hood (implemented as raw pointer derefs in this case), but it's possible to build safe abstractions on top of that. That blog post links this, for example, although it didn't end up using it (EDIT: Actually, it initially didn't, but now does): https://github.com/AdamNiederer/faster . It does use unsafe internally: https://github.com/AdamNiederer/faster/search?q=unsafe&unsco...

pornel7y ago

I don't think 25x speed difference can be attributed to safe code. In most cases safe code just means extra bounds checking or refcounting here and there, which aren't that bad even in hot loops.

The fragment of code in the article with `get_unchecked_mut()` shouldn't be necessary. It's a simple case that LLVM should be able to optimize. And if it didn't, it could be helped by either iterating both slices with `zip()` or a trick `let slice = &slice[0..len];` which proves to LLVM that you have the required length and it doesn't need to be checked again.

But overall the article seems in line what you'd expect from Rust: you have low-level control over memory layout and safety checks, and you can make trade-offs to squeeze maximum performance out of an algorithm if you need to.

staticassertion7y ago

SIMD requires unsafe for some reason. Not sure exactly why.

I don't see why they used unsafe in `add_assign` - an assert!(octets.len() == other.len()) would likely have elided the bounds checks.

raphlinus7y ago

I talk a little about why in my "fearless SIMD" blog post (also introducing a prototype that wraps the unsafety but at the moment is very limited in exactly what SIMD operations are exposed). https://raphlinus.github.io/rust/simd/2018/10/19/fearless-si...

MaulingMonkey7y ago

https://users.rust-lang.org/t/how-to-zip-two-slices-efficien... explores eliding the bounds checks even without an assert! (though by using the smaller length as the bounds, which might not be the desired behavior)

tgtweak7y ago

Anything outside of rust (like assembly or simd) is not safe-ensurable by rusts build-in, just like interfacing with c libraries in golang is not safe.

I would imagine without these simd optimizations you would see a 2X slowdown at least.

raccoononeOP7y ago

I ran the benchmarks with SIMD disabled. Encoding goes from ~950Mbit/s down to ~350Mbit/s

layoutIfNeeded7y ago

What is the patent situation with Raptor codes nowadays?

sudhirj7y ago

We have a license and use them for satellite broadcasting, but my understanding is Qualcomm has made statements in https://datatracker.ietf.org/ipr/1511/ that if you use it for a "wireless wide-area standard (for example, a UMTS-compatible handset or Infrastructure equipment)" you'll be charged a standard royalty fee, otherwise they don't care.

ignoramous7y ago

Noob question: Any reason folks would use FountainCode/RaptorQ over TurboCode [0] / PolarCode [1]? What's the difference?

[0] https://en.wikipedia.org/wiki/Turbo_code

[1] https://en.wikipedia.org/wiki/Polar_code_(coding_theory)

1 more reply

detaro7y ago

... as long as you use it for implementing a specific IETF specification. At least that how it reads to me?

1 more reply

mmastrac7y ago

Was just coming to ask the same question. Last I remember fountain codes were pretty locked down.

herniatedeel7y ago

The inventor of Raptor codes, M. Amin Shokrollahi, sold his company, Digital Fountain, to Qualcomm. Upon the sale to Qualcomm, Qualcomm acquired all of Digital Fountain's IP rights.

Qualcomm has asserted that these Raptor code-related patents (an early one of which was filed in 2004) are standards essential, and require to be licensed from Qualcomm.[1][2]

The below-linked patent would expire in 2024. However, there are a slew of continuation applications that expire much later than 2024.

Update: additionally, there is at least one earlier-dated patent filed in 1999, which expired in February. [3]

[1] https://datatracker.ietf.org/ipr/2554/

[2] https://patents.google.com/patent/US7139960

[3] https://patents.google.com/patent/US6307487

1 more reply

j / k navigate · click thread line to collapse

48 comments

lower7y ago

v_lisivka7y ago

Try hashbrown. Author declares 8x improvement over standard HashMap.

https://github.com/Amanieu/hashbrown

https://blog.waffles.space/2018/12/07/deep-dive-into-hashbro...

https://gankro.github.io/blah/hashbrown-insert/

steveklabnik7y ago

It soon will be the standard hashmap!

1 more reply

lower7y ago

Thanks! I just replaced FnvHashMap with hashbrown::HashMap and get a 20% overall performance improvement in my workload (which is rather HashMap-heavy).

jjnoakes7y ago

Slightly tangential but c++ hash tables aren't even the fastest. The requirements on iterator invalidation force the use of extra allocations and indirections for separate chaining.

ummonk7y ago

Luker887y ago

I have a C++ implementation of Raptorq.

Never went for performance, but at the most I use `std::map<int, pointer>` (which is a RB-tree).

You can implement RQ without a hashmap.

--edit: typos

Luker887y ago

I have played with a C++ implementation of RFC6330, and while I have to say I never went for performance, I find this benchmark a bit... pointless.

This implementation does not do the (complex and almost pointless) interleaving, which is fine, even OpenRQ does not.

The bench seems to be done on a.... 10kb file? It all fits in the L2. We are not given the symbol size (which determines the block size!) and I assume all of this fits in a 10x10 matrix.

You are benchmarking operations on a matrix that is (more or less) a 10x10 byte matrix.

The biggest part of this benchmark might almost be the generation of repair symbols (was it even done?), since that would require multiple xoring of the above-mentioned symbols.

This is much closer to micro-benchmarking than an actual benchmark, imho. It would have been more interesting to see what happens with files at least larger than the L3 cache.

Still, it's nice to see implementations of the RFC, which is a real pain to read...

sudhirj7y ago

If anyone is qualified enough to take a crack at it I can try and arrange for sponsorship as well.

nullc7y ago

RaptorQ, even in this implementation, appears to be a lot slower than wirehair: https://github.com/catid/wirehair

Luker887y ago

Thanks for the pointer, any other similar algorithms?

O(n) sounds impressive, although I heard such claims on RQ too...

RQ complexity is O(n^3), but on the internal matrix, not on the input. It all depends on the blocksize you choose.

The numbers are impressive, too. Encoding 40Mbytes in 0usec? Will have to check the theory behind that.

nullc7y ago

You could look at ldpc triangle/staircase. They're fast even in a fairly naive implementation but have fairly high overhead unless most of your input is the systematic part of the codeword.

mike006327y ago

For anyone interested in how this code works, here is Qualcomm's technical overview of RaptorQ: https://www.qualcomm.com/media/documents/files/raptorq-techn...

eridius7y ago

I don't know much about SIMD but are there no alignment concerns with reinterpreting a slice as an AVX vector?

ajtulloch7y ago

Unaligned and aligned loads of AVX vectors have basically the same performance since Ivy Bridge IIRC.

jnordwick7y ago

1 more reply

tgtweak7y ago

Very cool efforts here. Curious what the speed is without those instructions present on the cpu, and what hardware this was run on.

The concept of a fountain seems interesting but what is a good use case? Variable strength error correction?

sudhirj7y ago

That's why the "fountain". It's a data fountain, and you can grab any quantity of data off it at any point and still have a good chance of reconstructing the message.

tgtweak7y ago

3 more replies

bb887y ago

Multicast reliable transmission that scales well to many many clients.

minimax7y ago

ignoramous7y ago

I'd be remiss not to mention that the lectures are by the late David MacKay: https://news.ycombinator.com/item?id=11500221

layoutIfNeeded7y ago

Huffman: Fundamentals of Error-Correcting Codes

ISBN 9780521782807

empath757y ago

Anybody want to comment on whether all of his unsafe code was actually necessary? Seems bad for rust that safe code is 25x slower.

MaulingMonkey7y ago

> Seems bad for rust that safe code is 25x slower.

But - I strongly suspect not all that unsafe is necessary, and the post certainly doesn't claim it is.

https://coaxion.net/blog/2018/01/speeding-up-rgb-to-grayscal...

pornel7y ago

I don't think 25x speed difference can be attributed to safe code. In most cases safe code just means extra bounds checking or refcounting here and there, which aren't that bad even in hot loops.

staticassertion7y ago

SIMD requires unsafe for some reason. Not sure exactly why.

I don't see why they used unsafe in `add_assign` - an assert!(octets.len() == other.len()) would likely have elided the bounds checks.

raphlinus7y ago

MaulingMonkey7y ago

tgtweak7y ago

Anything outside of rust (like assembly or simd) is not safe-ensurable by rusts build-in, just like interfacing with c libraries in golang is not safe.

I would imagine without these simd optimizations you would see a 2X slowdown at least.

raccoononeOP7y ago

I ran the benchmarks with SIMD disabled. Encoding goes from ~950Mbit/s down to ~350Mbit/s

layoutIfNeeded7y ago

What is the patent situation with Raptor codes nowadays?

sudhirj7y ago

ignoramous7y ago

Noob question: Any reason folks would use FountainCode/RaptorQ over TurboCode [0] / PolarCode [1]? What's the difference?

[0] https://en.wikipedia.org/wiki/Turbo_code

[1] https://en.wikipedia.org/wiki/Polar_code_(coding_theory)

1 more reply

detaro7y ago

... as long as you use it for implementing a specific IETF specification. At least that how it reads to me?

1 more reply

mmastrac7y ago

Was just coming to ask the same question. Last I remember fountain codes were pretty locked down.

herniatedeel7y ago

The inventor of Raptor codes, M. Amin Shokrollahi, sold his company, Digital Fountain, to Qualcomm. Upon the sale to Qualcomm, Qualcomm acquired all of Digital Fountain's IP rights.

Qualcomm has asserted that these Raptor code-related patents (an early one of which was filed in 2004) are standards essential, and require to be licensed from Qualcomm.[1][2]

The below-linked patent would expire in 2024. However, there are a slew of continuation applications that expire much later than 2024.

Update: additionally, there is at least one earlier-dated patent filed in 1999, which expired in February. [3]

[1] https://datatracker.ietf.org/ipr/2554/

[2] https://patents.google.com/patent/US7139960

[3] https://patents.google.com/patent/US6307487

1 more reply

j / k navigate · click thread line to collapse