Déjà vu: Fast efficient probabilistic deduplication (opens in new tab)

(github.com)

48 pointsf4838y ago12 comments

12 comments

Bloom filters don't seem useful as the primary means of deduplication for an actual data storage system to me.

- False positives means marking data as duplicate when it's not.

- Bloom filters are not associative. So unless the application is very special and can find/identify/retrieve the data later in some other (likely inefficient?) way, a separate index is required anyway. But if you have a separate index of the data you already have, you can just use that to deduplicate. This index is the main memory cost of deduplicating storage.

Bloom filters are of course interesting in certain scenarios e.g. to statistically reduce network traffic.

f483OP8y ago

> Bloom filters are of course interesting in certain scenarios e.g. to statistically reduce network traffic.

Yes I originally wrote it for exactly this such a scenario.

This is not for you if you cannot live with any chance of a false positive, even an extremely small one. Note that even in most cases a false positive is acceptable if the chance of a false positive is less then the chance of a hardware error.

Not that just because it is probabilistic does not mean its not useful in many cases. The default setup with 1mil entrie memory and 1/1mil chance false positive chance will run at ~8M mem usage. This is a quite an acceptable trade off in many cases.

maxdemarzi8y ago

May want to take a look at using a Cuckoo Filter instead of a Bloom Filter. See https://maxdemarzi.com/2017/07/13/using-a-cuckoo-filter-for-...

f483OP8y ago

Thanks for the feedback, will look into it.

adrianratnapala8y ago

Perhaps like the other ZFS fanboys on this thread, I was hoping that "efficient" would by some magic mean "sublinear in memory", but it doesn't look like that. :(

f483OP8y ago

Sorry I cant do magic just yet :(

jwilk8y ago

Poor name choice.

https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_(software)

https://github.com/worldveil/dejavu

https://github.com/appbaseio/dejavu

https://github.com/IndigoUnited/js-dejavu

f483OP8y ago

Yeah, should have put more thought into it then I did. But I guess its to late now.

chongli8y ago

You missed one! One of my favourite adventure games!

https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_(video_game)

ComputerGuru8y ago

I must say his is the most-fitting use of the name, though.

bradknowles8y ago

Could you use this to deduplicate files in a filesystem?

Could you use this as the deduplication method in ZFS?

f483OP8y ago

> Could you use this to deduplicate files in a filesystem?

Depends on what you mean and your constraints. Deduplicate file entries if you can live with a rare false positives, sure. A setup of 1million entrie limit with 1/1billion false positive chance will get ~80M mem usage.

If you want to check for duplicate files, probably not.

> Could you use this as the deduplication method in ZFS?

I would say no.

j / k navigate · click thread line to collapse

12 comments

blattimwind8y ago

Bloom filters don't seem useful as the primary means of deduplication for an actual data storage system to me.

- False positives means marking data as duplicate when it's not.

Bloom filters are of course interesting in certain scenarios e.g. to statistically reduce network traffic.

f483OP8y ago

> Bloom filters are of course interesting in certain scenarios e.g. to statistically reduce network traffic.

Yes I originally wrote it for exactly this such a scenario.

maxdemarzi8y ago

May want to take a look at using a Cuckoo Filter instead of a Bloom Filter. See https://maxdemarzi.com/2017/07/13/using-a-cuckoo-filter-for-...

f483OP8y ago

Thanks for the feedback, will look into it.

adrianratnapala8y ago

Perhaps like the other ZFS fanboys on this thread, I was hoping that "efficient" would by some magic mean "sublinear in memory", but it doesn't look like that. :(

f483OP8y ago

Sorry I cant do magic just yet :(

jwilk8y ago

Poor name choice.

https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_(software)

https://github.com/worldveil/dejavu

https://github.com/appbaseio/dejavu

https://github.com/IndigoUnited/js-dejavu

f483OP8y ago

Yeah, should have put more thought into it then I did. But I guess its to late now.

chongli8y ago

You missed one! One of my favourite adventure games!