FeatureBase: Open-Source, Real-Time Database Built on Roaring Bitmaps (opens in new tab)

(github.com)

37 pointsclowen3y ago12 comments

12 comments

I've seen bitmaps mentioned a number of times lately. I must admit it is not something I am all that familiar with. Can someone explain to me why bitmaps are more valuable than standard column oriented databases?

I havn't wrapped my head around how this helps speed up queries while data is being ingested.

kurosknight3y ago

They are useful for categorical variables. For example, is a record in the "Likes motorcycles" category? They are fast because (well, one reason) bitwise logical operations are very fast for CPUs to do.

Adtech is an example of a sector that benefits from this...they slice and dice datasets a lot to target ad campaigns and such. Being able to do that quickly is useful.

JAA13373y ago

So are you saying that the data is stored in categories which allows for those types of lookups to run faster? Do you have specifics on how the design of a bitmap based database achieves this? How does it maintain these relationships? Just through 0 and 1's?

I guess it's easy for me to visualize both row and column based. Im struggling with the bitmaps concept.

2 more replies

witchbane3y ago

So is the thinking you run them alongside a traditional RDBMS as sort of a cache or view optimized for bitwise operations?

seebs3y ago

I'm super hyped about this, I've been working on this for the last couple-few years and I'm optimistic about the return to being primarily an open-source thing.

wentejing3y ago

Why is the bigmap database faster than other distributed database? and what's the differences?

kordlessagain3y ago

Instead of storing values, like "dog", "cat", or "mouse" it stores (in this example) three binary numbers:

000 - whatever needs to associate with animals, but has no associations currently

001 - whatever it is is associated with having a "mouse" included

111 - whatever it is is associated with having a "dog", a "cat" and a "mouse" included

In the past, high cardinality data sets weren't good for storing in binary form, or a binary index, but nowadays there are ways around this. So, that list of animals could be quite large.

The primary reason it's so much faster is that many CPUs nowadays can do 10s of lookups in a single instruction cycle. That makes them extremely fast.

elina1233y ago

any ideas on real life Use Cases?

clowenOP3y ago

I am planning to use it in a project to make the new Congressional API data more approachable for people.

https://api.congress.gov/#/

Hopefully make it easy to find all the bills that your specific congress person was involved in for example.

kordlessagain3y ago

From what I understand, machine learning models may use this in ETL pipelines as well as serving as part of the models themselves. There's an article on that here: https://medium.com/analytics-and-data/overview-of-the-differ...

FeatureBase could be the "feature store" in the middle of the batch prediction section's diagram, or simply be a drop-in replacement for the model's registry.

jaffee3y ago

many! It was originally developed for marketing use cases- helping marketers understand up-to-date use her behavior and find interesting segments.

But really it's useful anytime you need low latency analytics on fresh data.

j / k navigate · click thread line to collapse

12 comments

JAA13373y ago

I havn't wrapped my head around how this helps speed up queries while data is being ingested.

kurosknight3y ago

Adtech is an example of a sector that benefits from this...they slice and dice datasets a lot to target ad campaigns and such. Being able to do that quickly is useful.

JAA13373y ago

I guess it's easy for me to visualize both row and column based. Im struggling with the bitmaps concept.

2 more replies

witchbane3y ago

So is the thinking you run them alongside a traditional RDBMS as sort of a cache or view optimized for bitwise operations?

seebs3y ago

I'm super hyped about this, I've been working on this for the last couple-few years and I'm optimistic about the return to being primarily an open-source thing.

wentejing3y ago

Why is the bigmap database faster than other distributed database? and what's the differences?

kordlessagain3y ago

Instead of storing values, like "dog", "cat", or "mouse" it stores (in this example) three binary numbers:

000 - whatever needs to associate with animals, but has no associations currently

001 - whatever it is is associated with having a "mouse" included

111 - whatever it is is associated with having a "dog", a "cat" and a "mouse" included

In the past, high cardinality data sets weren't good for storing in binary form, or a binary index, but nowadays there are ways around this. So, that list of animals could be quite large.

The primary reason it's so much faster is that many CPUs nowadays can do 10s of lookups in a single instruction cycle. That makes them extremely fast.

elina1233y ago

any ideas on real life Use Cases?

clowenOP3y ago

I am planning to use it in a project to make the new Congressional API data more approachable for people.

https://api.congress.gov/#/

Hopefully make it easy to find all the bills that your specific congress person was involved in for example.

kordlessagain3y ago

FeatureBase could be the "feature store" in the middle of the batch prediction section's diagram, or simply be a drop-in replacement for the model's registry.

jaffee3y ago

many! It was originally developed for marketing use cases- helping marketers understand up-to-date use her behavior and find interesting segments.

But really it's useful anytime you need low latency analytics on fresh data.

j / k navigate · click thread line to collapse