The Nimble File Format by Meta (opens in new tab)

(github.com)

52 pointszzulus2y ago21 comments

21 comments

Where is the file format specification?

I would prefer to write a parser with zero dependencies.

They strongly discourage you.

Unified: More than a specification, Nimble is a product. We strongly discourage developers to (re-)implement Nimble’s spec to prevent environmental fragmentation issues observed with similar projects in the past. We encourage developers to leverage the single unified Nimble library, and create high-quality bindings to other languages as needed.

jgarzik2y ago

Sigh. I get more than enough of that in the blockchain world.

Call me a greybeard; I want multiple implementations and a spec.

SonOfLilit2y ago

To be clear, I agree with you (and my beard is kinda graying). Just pointing out that they don't.

hi-v-rocknroll2y ago

A spec. There is no substitute for it. Code isn't a spec. A spec is documentation that describes a format and/or operation. The absence of it is laziness and/or arrogance.

1 more reply

karmakaze2y ago

Why not instead make a test suite of inputs and expected pass/fail for validation of implementations.

It's not more than a specification if a single implementation is to be used--then it's a 'spec' defined by the implementation because any idiosyncrasies of that implementation become defacto specification.

yencabulator2y ago

Untrusted file parsing in C++ is not how many people wish to roll in 2024.

CharlesW2y ago

Video link and more comments from two weeks ago: https://news.ycombinator.com/item?id=39995112

winwang2y ago

Although "wide" data is touted as a optimization guideline for Nimble, how well does it fare against "normal"(?) data, i.e. with just a few to tens of columns?

Also, are there any preliminary benchmarks?

quadrature2y ago

There are some numbers in the youtube presentation https://www.youtube.com/watch?v=bISBNVtXZ6M

It seems to be optimized towards ML where sequential scan is the access pattern. so it wouldn't be suitable for analytical workloads yet, though they are planning on working on that.

1-62y ago

I wonder how well it functions as a blobstore.

levzettelin2y ago

What are the differences versus Arrow/Parquet?

snthpy2y ago

Yes, and also compared to Lance [1]. They have a spec [2] and I got a lot more out of Chang's talk [3] than the Nimble one.

1: https://lancedb.github.io/lance/

2: https://lancedb.github.io/lance/format.html

3: https://youtu.be/ixpbVyrsuL8?si=9QhF0wyxYtl2L01_

westonpace2y ago

Lance dev here. We are working on a new version of our format[1] as well. We are watching Nimble too. If they are interested in solving our use cases then that is less work for us.

At the moment it is not clear that is the case. However, it is too early to tell. Our biggest concerns are:

- Good integration with object storage

- Ability to write multi-modal data without exhausting memory

- Support for fast point-lookups (with the option of cranking up the amount of metadata for richer lookup structures that will be cached in RAM)

Both Nimble and Lance are not intended to replace Parquet/Arrow. Parquet and Arrow are designed to be spread throughout a solution as a universal interchange format. E.g. you will often see them all throughout ETL pipelines so that different components can transfer data (even if it isn't a ton of data). With Arrow and Parquet interoperability is a higher priority than performance (though these formats are fast as well). They are developed slowly, via consensus, as they should be.

Nimble and Lance are designed for "search nodes" / "scan nodes" which are meant to sit in front of a large stockpile of data and access it efficiently. There are typically only a few such components (usually just a single one) in a solution (e.g. the database). Performance is the primary goal (though we do attempt to document things clearly should others wish to learn / build upon). I'd advise anyone building a search node or scan node to make the file format a configurable choice hidden behind some kind of interface.

[1] https://blog.lancedb.com/lance-v2/

nmstoker2y ago

Yes, I'd be curious to know how much better it is than them - from my limited understanding, they also share many of the advantages that Nimble boasts of, thus I can appreciate they'd both be better than legacy formats but it's not clear how close these two are.

zX41ZdbW2y ago

Interesting if there are benefits over MergeTree (ClickHouse's data format).

1-62y ago

I thought the future would have been unstructured.

albertzeyer2y ago

https://xkcd.com/927/ ?

We still use HDF (https://en.wikipedia.org/wiki/Hierarchical_Data_Format).

But I wonder, if I would choose a new file format today, what to choose? Nimble is maybe too new and there is too less experience with it (outside Meta).

Is there anywhere a good overview of all available options, and some fair comparison? Some that I found, but older:

https://www.hopsworks.ai/post/guide-to-file-formats-for-mach...

https://iopscience.iop.org/article/10.1088/1742-6596/1085/3/...

https://github.com/pangeo-data/pangeo/issues/285

yencabulator2y ago

Well, Parquet seems to be so widely supported, it's my default pick, unless you can explain why it's not the right fit.

Though I'll say if your primary use case is "higher-dimensional arrays", none of Parquet etc are likely to be a good fit -- these things are columnar formats where each column has a separate name, datatype etc, not formats for multi-dimensional arrays of numbers. That's a different problem. A Parquet column can be a list of arrays, but there's no special handling of matrices.

j / k navigate · click thread line to collapse

21 comments

jgarzik2y ago

Where is the file format specification?

I would prefer to write a parser with zero dependencies.

SonOfLilit2y ago

They strongly discourage you.

jgarzik2y ago

Sigh. I get more than enough of that in the blockchain world.

Call me a greybeard; I want multiple implementations and a spec.

SonOfLilit2y ago

To be clear, I agree with you (and my beard is kinda graying). Just pointing out that they don't.

hi-v-rocknroll2y ago

A spec. There is no substitute for it. Code isn't a spec. A spec is documentation that describes a format and/or operation. The absence of it is laziness and/or arrogance.

1 more reply

karmakaze2y ago

Why not instead make a test suite of inputs and expected pass/fail for validation of implementations.

yencabulator2y ago

Untrusted file parsing in C++ is not how many people wish to roll in 2024.

CharlesW2y ago

Video link and more comments from two weeks ago: https://news.ycombinator.com/item?id=39995112

winwang2y ago

Although "wide" data is touted as a optimization guideline for Nimble, how well does it fare against "normal"(?) data, i.e. with just a few to tens of columns?

Also, are there any preliminary benchmarks?

quadrature2y ago

There are some numbers in the youtube presentation https://www.youtube.com/watch?v=bISBNVtXZ6M

It seems to be optimized towards ML where sequential scan is the access pattern. so it wouldn't be suitable for analytical workloads yet, though they are planning on working on that.

1-62y ago

I wonder how well it functions as a blobstore.

levzettelin2y ago

What are the differences versus Arrow/Parquet?

snthpy2y ago

Yes, and also compared to Lance [1]. They have a spec [2] and I got a lot more out of Chang's talk [3] than the Nimble one.

1: https://lancedb.github.io/lance/

2: https://lancedb.github.io/lance/format.html

3: https://youtu.be/ixpbVyrsuL8?si=9QhF0wyxYtl2L01_

westonpace2y ago

Lance dev here. We are working on a new version of our format[1] as well. We are watching Nimble too. If they are interested in solving our use cases then that is less work for us.

At the moment it is not clear that is the case. However, it is too early to tell. Our biggest concerns are:

- Good integration with object storage

- Ability to write multi-modal data without exhausting memory

- Support for fast point-lookups (with the option of cranking up the amount of metadata for richer lookup structures that will be cached in RAM)

[1] https://blog.lancedb.com/lance-v2/

nmstoker2y ago

zX41ZdbW2y ago

Interesting if there are benefits over MergeTree (ClickHouse's data format).

1-62y ago

I thought the future would have been unstructured.

albertzeyer2y ago

https://xkcd.com/927/ ?

We still use HDF (https://en.wikipedia.org/wiki/Hierarchical_Data_Format).

But I wonder, if I would choose a new file format today, what to choose? Nimble is maybe too new and there is too less experience with it (outside Meta).

Is there anywhere a good overview of all available options, and some fair comparison? Some that I found, but older:

https://www.hopsworks.ai/post/guide-to-file-formats-for-mach...

https://iopscience.iop.org/article/10.1088/1742-6596/1085/3/...

https://github.com/pangeo-data/pangeo/issues/285

yencabulator2y ago

Well, Parquet seems to be so widely supported, it's my default pick, unless you can explain why it's not the right fit.

j / k navigate · click thread line to collapse