Fc, a lossless compressor for floating-point streams (opens in new tab)

(github.com)

82 pointsenduku15d ago27 comments

27 comments

I must say, I was surprised that, for a library advertising handling of streams of data, the absence of a stream utility to [input] | fc | fc -d surprised me.

I understand this is more the primitive that you would build such a thing on top of, just that the first question I always have for novel compressors is "how do they do on these example streams of data".

radford-neal12d ago

Those interested in this might find my paper on "Representing numeric data in 32 bits while preserving 64-bit precision" to be of interest. Can be found at https://arxiv.org/abs/1504.02914 (note the code available as auxilliary files). In the context of this compressor, it could be one of the compressors competing to compress a block. It works well for data converted from a decimal representation with a small number of digits.

userbinator12d ago

It splits the input into adaptively-sized blocks (quanta), runs a competition between many specialized codecs on each block, and emits the smallest result.

This is, for lack of a better term, a "metacompressor", but it will be interesting to see which of the choices end up dominating; in my past experiences with metacompression, one algorithm is usually consistently ahead.

apodik12d ago

I’ve never heard of a metacompressor before, what others exist?

pella12d ago

> "fc is a lossless compressor for streams of IEEE-754 64-bit doubles."

The new OpenZL SDDL2 (Simple Data Description Language) supports several different floating-point types. It would be worthwhile to contribute some of the FC project's experience to OpenZL. Now the OpenZL supported types:

  | Type           | Size    |Endian|
  |----------------|---------|-----|
  | `Int8`         | 1 byte  | N/A |
  | `UInt8`        | 1 byte  | N/A |
  | `Int16LE/BE`   | 2 bytes | Yes |
  | `UInt16LE/BE`  | 2 bytes | Yes |
  | `Int32LE/BE`   | 4 bytes | Yes |
  | `UInt32LE/BE`  | 4 bytes | Yes |
  | `Int64LE/BE`   | 8 bytes | Yes |
  | `UInt64LE/BE`  | 8 bytes | Yes |
  | `Float16LE/BE` | 2 bytes | Yes |
  | `Float32LE/BE` | 4 bytes | Yes |
  | `Float64LE/BE` | 8 bytes | Yes |
  | `BFloat16LE/BE`| 2 bytes | Yes |
  | `Bytes(n)`     | n bytes | N/A |

Some links:

- https://github.com/facebook/openzl/releases/tag/v0.2.0

- https://openzl.org/getting-started/introduction/

- https://openzl.org/sddl/sddl2-announcement/

- https://openzl.org/sddl/core-concepts/

endukuOP12d ago

Thanks, this looks super relevant. I think the transferable part is the per-block selectrover predictors, strides, deltas, exponent/mantissa-ish structure, byte transpose, fallback raw/LZ, etc.sddl2 looks like a natural place to try some of that.

childintime12d ago

A lossy compressor might also be useful for common floating point apps. The simplest compressor ever would just chop off a number of bits from the mantissa.

endukuOP12d ago

Yeah, and also approximating a double (within range) to int32 :)

https://x.com/Densebit/status/1839705674378613043?s=20

dahart12d ago

That code is absolutely terrible! Never do that. The range is awful, and the relative error is awful.

If you want a double in 32 bits, convert to single precision float. This will beat the relative error of the code you linked to by orders of magnitude, and allow the range of float (~1e38) rather than be limited to +- 1e9.

loeg12d ago

The question is, how close can OpenLZ come? (This is from the same people who develop zstd, but suitable for structured data in a generic way.)

endukuOP12d ago

I need to add it to the benchmark. My expectation is that OpenZL should be strong when the enclosing format is known and SDDL can separate typed fields cleanly. Running both on the same f64 arrays will give some information

Scaevolus12d ago

I see you have ALP, but have you tried Chimp128 or Arrow's byte stream split?

endukuOP12d ago

I have an XOR128-style mode and a byte-transpose/byte-split-like mode, but I should not claim that as a proper Chimp128 or Arrow Parquet byte-stream-split comparison yet. I willadd direct baselines for Chimp128 and Arrow/Parquet BSS+zstd to the harness.

abcd_f12d ago

The most interesting section - How It Works - could really elaborate on details a bit more.

endukuOP12d ago

Agreed. will work on that :)

KerrickStaley12d ago

Another library in this space is pcodec; I'd appreciate a comparison of the two.

endukuOP12d ago

Agreed; pcodec is probably one of the most relevant comparisons. I will add pcodec to teh benchmark

endukuOP15d ago

I built "fc", a C library for compressing streams of 64-bit floating-point values without quantization.

It is not trying to replace zstd or lz4. The idea is narrower: take blocks of doubles, try a set of float-specific predictors/transforms/coders, and emit whichever representation is smallest for that block.

It is aimed at time-series, scientific, simulation, and analytics data where the numbers often have structure: smooth curves, repeated values, fixed increments, periodic signals, predictable deltas, or low-entropy mantissas.

The API is intentionally small: "fc_enc", "fc_dec", a config struct, and a few counters to inspect which modes won. Decode is parallel and meant to be fast; encode spends more CPU searching for a better representation.

Current caveats: x86-64 only for now, tuned for IEEE-754 doubles, research-grade rather than production-hardened.

Repo: https://github.com/xtellect/fc

gus_massa14d ago

Does it assume the floats come from photos or sound or something?

j / k navigate · click thread line to collapse

27 comments

rincebrain12d ago

I must say, I was surprised that, for a library advertising handling of streams of data, the absence of a stream utility to [input] | fc | fc -d surprised me.

radford-neal12d ago

userbinator12d ago

It splits the input into adaptively-sized blocks (quanta), runs a competition between many specialized codecs on each block, and emits the smallest result.

apodik12d ago

I’ve never heard of a metacompressor before, what others exist?

pella12d ago

> "fc is a lossless compressor for streams of IEEE-754 64-bit doubles."

  | Type           | Size    |Endian|
  |----------------|---------|-----|
  | `Int8`         | 1 byte  | N/A |
  | `UInt8`        | 1 byte  | N/A |
  | `Int16LE/BE`   | 2 bytes | Yes |
  | `UInt16LE/BE`  | 2 bytes | Yes |
  | `Int32LE/BE`   | 4 bytes | Yes |
  | `UInt32LE/BE`  | 4 bytes | Yes |
  | `Int64LE/BE`   | 8 bytes | Yes |
  | `UInt64LE/BE`  | 8 bytes | Yes |
  | `Float16LE/BE` | 2 bytes | Yes |
  | `Float32LE/BE` | 4 bytes | Yes |
  | `Float64LE/BE` | 8 bytes | Yes |
  | `BFloat16LE/BE`| 2 bytes | Yes |
  | `Bytes(n)`     | n bytes | N/A |

Some links:

- https://github.com/facebook/openzl/releases/tag/v0.2.0

- https://openzl.org/getting-started/introduction/

- https://openzl.org/sddl/sddl2-announcement/

- https://openzl.org/sddl/core-concepts/

endukuOP12d ago

childintime12d ago

A lossy compressor might also be useful for common floating point apps. The simplest compressor ever would just chop off a number of bits from the mantissa.

endukuOP12d ago

Yeah, and also approximating a double (within range) to int32 :)

https://x.com/Densebit/status/1839705674378613043?s=20

dahart12d ago

That code is absolutely terrible! Never do that. The range is awful, and the relative error is awful.

loeg12d ago

The question is, how close can OpenLZ come? (This is from the same people who develop zstd, but suitable for structured data in a generic way.)

endukuOP12d ago

Scaevolus12d ago

I see you have ALP, but have you tried Chimp128 or Arrow's byte stream split?

endukuOP12d ago

abcd_f12d ago

The most interesting section - How It Works - could really elaborate on details a bit more.

endukuOP12d ago

Agreed. will work on that :)

KerrickStaley12d ago

Another library in this space is pcodec; I'd appreciate a comparison of the two.

endukuOP12d ago

Agreed; pcodec is probably one of the most relevant comparisons. I will add pcodec to teh benchmark

endukuOP15d ago

I built "fc", a C library for compressing streams of 64-bit floating-point values without quantization.

Current caveats: x86-64 only for now, tuned for IEEE-754 doubles, research-grade rather than production-hardened.

Repo: https://github.com/xtellect/fc

gus_massa14d ago

Does it assume the floats come from photos or sound or something?

j / k navigate · click thread line to collapse