Show HN: B-field, a novel probabilistic key-value data structure (`rust-bfield`) (opens in new tab)

(github.com)

153 pointsboyd2y ago36 comments

`rust-bfield` is a Rust implementation of our novel "B-field" data structure, which functions like a Bloom filter for key-value lookups instead of set membership queries.

The B-field allows you to compactly store data using only a few bytes per key-value pair. We've successfully utilized it in genomics to associate billions of "k-mers" with taxonomic identifiers while maintaining an efficient memory footprint. But the data structure is also useful beyond computational biology, particularly where you have large unique key domains and constrained value ranges.

Available under an Apache 2 license. We hope it proves useful, and we're happy to answer any questions!

36 comments

pkhuong2y ago

I'd expect a comparison with compact (or even succinct) constructions for arbitrary functions, like MWHC. Section 3.2 of https://vigna.di.unimi.it/ftp/papers/TheoryPracticeMonotone.... has a good overview.

Given a set S of arbitrary hashable values, it's possible to represent a function from S to r bits in |S|r + o(|S|) bits (keys outside S are mapped to random r-bit values). More practical construction hit ~1.23 |S|r, or even |S|(r + 1.23) bits. It should also be faster to evaluate than `r` bloom filter lookups for large datasets.

I think the main advantage of the bloom filter (or compressed bitmap) approach is it can be updated incrementally. MWHC-style representations are better suited to build once / read many workloads.

boydOP2y ago

My understanding is that a perfect hash function maps elements elements to a unique integer (i.e., it's a one-to-one mapping). I think PHF data structures will also always return a value. So if you look up an element not in the constructed PHF, you'll always get a "false positive" value.

In contrast, a B-field lets you map a key to an arbitrary number of (typically non-unique) values. So I could map a million elements to "1", another million to "2", etc.

I'm not especially current (or fluent!) in that literature though, so would love pointers to anything that doesn't have the above constraints.

pkhuong2y ago

The MWHC construction represents minimal (monotone!) perfect hash functions as arbitrary functions to the ceil(log(n)) bits needed to store the rank... where the value happens to be the rank, but could be anything.

1 more reply

danking002y ago

I think it might help readers to include a narrative about an example application. Perhaps I’m in the minority but I tend to think of Bloom filters as a way to reliably know something isn’t in a set (e.g. so as to not run an expensive disk read). This data structure seems to view them the dual way: “this is maybe the right value for this key”.

I’ve seen that view work for visualizations like approximate CDFs and medians where I have some statement like “with probability p, the value differs from truth by less than e”. Is this data structure used in a similar way? My instinct is that visualizations having a low rate of being wrong is OK because the human will follow up that visualization with more tests. In the end you have lots of evidence supporting the conclusion.

boydOP2y ago

Ah, we need to clarify the language! The B-field will always return the correct value for an inserted key.

False positives are only returned for keys that have not been inserted. This is akin to a Bloom filter falsely returning that a key is in the set).

dpc_012342y ago

I second that "The B-field will always return the correct or Indeterminate value for an inserted key." before listing classes of errors would clarify it by a lot.

ComputerGuru2y ago

Curious idea. So it’s for cases where you have any key but associated with one of only (preferably few) discrete values. I.E. your url example is great with url as a key but subpar if url were to be the value (padded to length n with trailing nulls encoded as a fixed width int array)?

With its interesting set of guarantees, I can’t see a case where you could use this unless you are 100% positive all keys have previously been inserted into the set, otherwise you risk getting a wrong value in return (instead of no value). A traditional bloom filter is similar but in the worst case you throw away work because you look up the determinative data/value but here it’s a bit trickier.

Lots of applications tolerate missing results but significantly fewer can tolerate “unknowingly incorrect” results.

Question about the implementation: I would have expected the primary interface to be in-memory with some api for disk spillover for large datasets but while all the docs say “designed for in-memory lookups” the rust api shows that you need to provide it with a temp directory to create the structure? (Also, fyi, you use temp::temp_file() but never actually use the result, instead using the hard-coded /tmp path.)

foota2y ago

It seems like this would be most suitable for a system aggregating data. As long as you aggregate enough data points that the error averages out, it wouldn't be an issue.

I guess another use case could be as any kind of "hint" where you need to do an authoritative lookup regardless of the filter lookup.

E.g., the file might be on this host, but you'll need to reach the host and check for the file either way, so if you go to the wrong host sometimes, it's not the end of the world.

That's something that's not possible with a bloom filter.

Seems like you could combine a shared static file and a host local cache to work around the errors as well (e.g., each host can cache whatever keys they've looked up that were wrong, but they can do LRU to get the best of both worlds (frequently accessed data is correct, while you can look up infrequent data with some chance of a miss).

boydOP2y ago

I think those are both good examples of where you can manage the cost of a false positive.

In genomics, we're using this to map a DNA substring (or "k-mer") to a value. We can tolerate a very low error rate for those individual substrings, especially since any erroneous values will be random (vs. having the same or correlated values). So, with some simple threshold-based filtering, our false positive problem goes away.

Again, you'll never get the incorrect value for a key in the B-field, only for a key not in the B-field (which can return a false positive with a low, tunable error rate).

1 more reply

boydOP2y ago

> So it’s for cases where you have any key but associated with one of only (preferably few) discrete values

We use it for a case with ~million unique values, but it's certainly more space efficient for cases where you have tens, hundreds, or thousands of values. The "Space Requirements" section has a few examples: https://github.com/onecodex/rust-bfield?tab=readme-ov-file#s... (e.g., you can store a key-value pair with 32 distinct values in ~27 bits of space at a 0.1% false positive rate).

> all the docs say “designed for in-memory lookups”

We use mmap for persistence as our use case is largely a build-once, read many times one. As a practical matter, the data structure involves lots of random access, so is better suited to in-memory use from a speed POV.

> fyi, you use temp::temp_file() but never actually use the result, instead using the hard-coded /tmp path

Thank you, have opened an issue and we'll fix it!

ComputerGuru2y ago

Sure, but I wouldn’t expect the api to force you to use an mmap when a slice of bytes would accomplish the same when unpersisted (and the user could choose to persist via a different mechanism if you have a .into() method that decays self into a Vec<u8>/Box<[u8]>/etc)

If I were to design this library, I would internally use an enum { Mapped(mmap), Direct(Box<[u8]>) } or better yet, delegate access and serialization/persistence to a trait so the type becomes BField<Impl> where the impl trait provides as_slice() and load()/save().

This way you abstract over the OS internals, provide a pure implementation for testing or no_std, and probably improve your codegen a bit.

Buttons8402y ago

Wonder if the error rate can be controlled?

boydOP2y ago

Yes you can manage the error rate by controlling the overall size of the allocated bit array and several other parameters. There's a (slightly obtuse) section on parameter selection here: https://github.com/onecodex/rust-bfield?tab=readme-ov-file#p...

And a Jupyter notebook example here: https://github.com/onecodex/rust-bfield/blob/main/docs/noteb...

We do need a better "smart parameter selection" method for instantiating a B-field on-the-fly.

judofyr2y ago

Very interesting and I'll have to read more to understand how it fully works, but _initially_ the space requirements doesn't seem too impressive? Am I missing something here? Is my calculation/assumption completely off? Maybe the solution here is more flexible?

One alternative approach for many of these problems is to start with a perfect minimal hash function which hashes your key into a unique number [0, N) and then have a packed array of size N where each element is of a fixed byte size. To look up the value you first use the hash function to get an index, and then you look up in the packed array. There's also no error rate here: This is exact.

PTHash (https://arxiv.org/abs/2104.10402) needs roughly ~4 bits per element to create a perfect minimal hash function.

> Store 1 billion web URLs and assign each of them one of a small number of categories values (n=8) in 2.22Gb (params include ν=8, κ=1, =0.1%; 19 bits per element)

Assuming that "n=8" here means "8 bits" we need 1GB (8 bits * billion) to represent all of the values, and then 500 MB for the hash function (4 bits * billion).

I also don't quite understand what "2.22Gb" here refers to. 19 bits * billion = 2.357 SI-giga bytes = 19 SI-giga bits = 2.212 gibi bytes.

> Store 1 billion DNA or RNA k-mers ("ACGTA...") and associate them with any of the ~500k bacterial IDs current described by NCBI in 6.93Gb (ν=62, κ=4, =0.1%; 59 bits per element)

"~500k bacterial ID" can be represented with 19 bits. 1 billion of these take ~2.3GB, and then we have the additional 500MB for the perfect hash function.

Another data structure which is even more fine-tuned for this problem space is Bumped Ribbon Retrieval (https://arxiv.org/abs/2109.01892) where they have <1% overhead over just storing the plain bit values.

EDIT: Aha! One thing I forgot about: The alternatives I mentioned above all have a construction cost. I've been playing with them in the 100k-1M range and they've all been pretty instant (<1s), but I don't have any experience in the billion range. Maybe it's too slow there?

c0deb0t2y ago

PTHash and other minimum perfect hash functions return an arbitrary value if the query key did not exist when building the MPHF, so they can be a lot smaller. B-field can identify query keys that don't exist in the set (with high probability?).

What I'm wondering is why the Kraken2 probabilistic hash table doesn't work. It uses 32 bits per element in an open addressing hash table. For 1 billion k-mers and 19 bits for the value, 32 - 19 = 13 bits of the key hash can be stored alongside the value, helping disambiguate hash collisions. If the load factor is 1.25x, then that's 4 * 10^9 * 1.25 = 5GB total, better than ~7GB. Also, this is only one cache miss (+ linear probing that can be SIMD accelerated) per lookup.

boydOP2y ago

> PTHash and other minimum perfect hash functions return an arbitrary value if the query key did not exist when building the MPHF, so they can be a lot smaller. B-field can identify query keys that don't exist in the set (with high probability?).

Yes, exactly.

> What I'm wondering is why the Kraken2 probabilistic hash table doesn't work.

I just skimmed the paper again (has been a while since a close reading), but my refreshed understanding is:

* Like the B-field, there are also false positives.

* When multiple hashed keys (k-mers) collide in the Kraken2 hash table, it has to store a "reduced" value for those key-value pairs. While there's an elegant solution for this issue for the problem of taxonomic classification (lowest common ancestor), it still results in a loss of specificity. There's a similar issue with "indeterminate" results in the B-field, but this rate can be reduced to ~0 with secondary arrays.

* The original Kraken2 paper describes using 17 bits for taxonomic IDs (~131K unique values). I don't know how many tax IDs current Kraken2 DB builds use offhand, but the error rate climbs significantly as you use additional bits for the value vs. key (e.g., to represent >=2^20 values, see Fig S4). I don't have a good sense for the performance and other engineering tradeoffs of just extending the hash code >32 bits. I also don't know what the data structure overhead is beyond those >32 bits/pair.

So, for a metagenomics classifier specifically, some subtle tradeoffs but honestly database quality and the classification algorithm likely matters a lot more than the marginal FP rates with either data structure -- we just happen to have come to this solution.

For other applications, my sense is a B-field is generally going to be much more flexible (e.g., supporting arbitrary keys vs. a specific fixed-length encoding) but of course it depends on the specifics.

ricardobeat2y ago

One huge downside of your suggested approach is that adding a single entry requires rebuilding the entire hash.

alexbowe2y ago

Interesting read, thanks for sharing!

If you have some benchmark results, it'd be great to see how it compares to traditional data structures in practice, for different datasets and varying k-mer lengths

boydOP2y ago

Thank you! The "Space Requirements" section in the README has a few examples, and your comment has made me realize our (micro-)benchmark link in the README is broken.

We'll get that fixed and maybe find the time to do a larger post with some benchmarks on both space/time tradeoffs and overall performance vs. other data structures.

xvilka2y ago

There's another library unrelated to the data structure but is from the same field - an interval tree structure - Lapper[1][2]

[1] https://github.com/sstadick/rust-lapper

[2] https://docs.rs/rust-lapper

neutrinobro2y ago

Ughh...the term B-field already has a very strong association with magnetic fields. I'm sure it sounded like a good name given the context, but these types of name-collisions generally makes searching for a specific topic more and more painful each year.

esafak2y ago

Old terms age out too, so it's not that bad. In the context of ML, for example, "generative models" meant something else twenty years ago. Nobody who's got into ML recently would even know what the old meaning is.

neutrinobro2y ago

I don't think Maxwell's equations are going anywhere anytime soon.

vslira2y ago

Great work, thanks for sharing!

In a somewhat tangent note, does anyone have a good resource for designing probabilistic data structures? At a high level, I'm looking for something that helps me understand what is and isn't feasible and, given a problem and constraints, how would I go on to design a specific DS for a problem. Doesn't need to be all that general, but something that is more than an analysis of existing structures

foota2y ago

I wonder... The comparison here is against a bloom filter, but is this actually more similar to a sketch?

Or... Actually this is sort of like a posting list (e.g., a list of places that a given document appears: https://en.m.wikipedia.org/wiki/Inverted_index)

esafak2y ago

It's a probabilistic associative array. A better benchmark is a Bloomier filter: https://en.wikipedia.org/wiki/Bloom_filter#Bloomier_filters

seffect2y ago

IIRC with a bloom filter if returns false you can be sure it is not in the set but if it returns true it probably is in the set but might be a clash giving a false positive?

Is the same true with this data structure.

I guess you could mitigate this by storing an additionally hash or the original key in it’s entirety as the value?

hexo2y ago

Is somewhere out there a non-rust version, please?

dmit2y ago

As in, any other language at all..?

j / k navigate · click thread line to collapse

36 comments

pkhuong2y ago

I think the main advantage of the bloom filter (or compressed bitmap) approach is it can be updated incrementally. MWHC-style representations are better suited to build once / read many workloads.

boydOP2y ago

In contrast, a B-field lets you map a key to an arbitrary number of (typically non-unique) values. So I could map a million elements to "1", another million to "2", etc.

I'm not especially current (or fluent!) in that literature though, so would love pointers to anything that doesn't have the above constraints.

pkhuong2y ago

1 more reply

danking002y ago

boydOP2y ago

Ah, we need to clarify the language! The B-field will always return the correct value for an inserted key.

False positives are only returned for keys that have not been inserted. This is akin to a Bloom filter falsely returning that a key is in the set).

dpc_012342y ago

I second that "The B-field will always return the correct or Indeterminate value for an inserted key." before listing classes of errors would clarify it by a lot.

ComputerGuru2y ago

Lots of applications tolerate missing results but significantly fewer can tolerate “unknowingly incorrect” results.

foota2y ago

It seems like this would be most suitable for a system aggregating data. As long as you aggregate enough data points that the error averages out, it wouldn't be an issue.

I guess another use case could be as any kind of "hint" where you need to do an authoritative lookup regardless of the filter lookup.

E.g., the file might be on this host, but you'll need to reach the host and check for the file either way, so if you go to the wrong host sometimes, it's not the end of the world.

That's something that's not possible with a bloom filter.

boydOP2y ago

I think those are both good examples of where you can manage the cost of a false positive.

Again, you'll never get the incorrect value for a key in the B-field, only for a key not in the B-field (which can return a false positive with a low, tunable error rate).

1 more reply

boydOP2y ago

> So it’s for cases where you have any key but associated with one of only (preferably few) discrete values

> all the docs say “designed for in-memory lookups”

> fyi, you use temp::temp_file() but never actually use the result, instead using the hard-coded /tmp path

Thank you, have opened an issue and we'll fix it!

ComputerGuru2y ago

This way you abstract over the OS internals, provide a pure implementation for testing or no_std, and probably improve your codegen a bit.

Buttons8402y ago

Wonder if the error rate can be controlled?

boydOP2y ago

And a Jupyter notebook example here: https://github.com/onecodex/rust-bfield/blob/main/docs/noteb...

We do need a better "smart parameter selection" method for instantiating a B-field on-the-fly.

judofyr2y ago

PTHash (https://arxiv.org/abs/2104.10402) needs roughly ~4 bits per element to create a perfect minimal hash function.

> Store 1 billion web URLs and assign each of them one of a small number of categories values (n=8) in 2.22Gb (params include ν=8, κ=1, =0.1%; 19 bits per element)

Assuming that "n=8" here means "8 bits" we need 1GB (8 bits * billion) to represent all of the values, and then 500 MB for the hash function (4 bits * billion).

I also don't quite understand what "2.22Gb" here refers to. 19 bits * billion = 2.357 SI-giga bytes = 19 SI-giga bits = 2.212 gibi bytes.

> Store 1 billion DNA or RNA k-mers ("ACGTA...") and associate them with any of the ~500k bacterial IDs current described by NCBI in 6.93Gb (ν=62, κ=4, =0.1%; 59 bits per element)

"~500k bacterial ID" can be represented with 19 bits. 1 billion of these take ~2.3GB, and then we have the additional 500MB for the perfect hash function.

c0deb0t2y ago

boydOP2y ago

Yes, exactly.

> What I'm wondering is why the Kraken2 probabilistic hash table doesn't work.

I just skimmed the paper again (has been a while since a close reading), but my refreshed understanding is:

* Like the B-field, there are also false positives.

ricardobeat2y ago

One huge downside of your suggested approach is that adding a single entry requires rebuilding the entire hash.

alexbowe2y ago

Interesting read, thanks for sharing!

If you have some benchmark results, it'd be great to see how it compares to traditional data structures in practice, for different datasets and varying k-mer lengths

boydOP2y ago

Thank you! The "Space Requirements" section in the README has a few examples, and your comment has made me realize our (micro-)benchmark link in the README is broken.

We'll get that fixed and maybe find the time to do a larger post with some benchmarks on both space/time tradeoffs and overall performance vs. other data structures.

xvilka2y ago

There's another library unrelated to the data structure but is from the same field - an interval tree structure - Lapper[1][2]

[1] https://github.com/sstadick/rust-lapper

[2] https://docs.rs/rust-lapper

neutrinobro2y ago

esafak2y ago

neutrinobro2y ago

I don't think Maxwell's equations are going anywhere anytime soon.

vslira2y ago

Great work, thanks for sharing!

foota2y ago

I wonder... The comparison here is against a bloom filter, but is this actually more similar to a sketch?

Or... Actually this is sort of like a posting list (e.g., a list of places that a given document appears: https://en.m.wikipedia.org/wiki/Inverted_index)

esafak2y ago

It's a probabilistic associative array. A better benchmark is a Bloomier filter: https://en.wikipedia.org/wiki/Bloom_filter#Bloomier_filters

seffect2y ago

IIRC with a bloom filter if returns false you can be sure it is not in the set but if it returns true it probably is in the set but might be a clash giving a false positive?

Is the same true with this data structure.

I guess you could mitigate this by storing an additionally hash or the original key in it’s entirety as the value?

hexo2y ago

Is somewhere out there a non-rust version, please?

dmit2y ago

As in, any other language at all..?

j / k navigate · click thread line to collapse