- False positives means marking data as duplicate when it's not.
- Bloom filters are not associative. So unless the application is very special and can find/identify/retrieve the data later in some other (likely inefficient?) way, a separate index is required anyway. But if you have a separate index of the data you already have, you can just use that to deduplicate. This index is the main memory cost of deduplicating storage.
Bloom filters are of course interesting in certain scenarios e.g. to statistically reduce network traffic.
Yes I originally wrote it for exactly this such a scenario.
This is not for you if you cannot live with any chance of a false positive, even an extremely small one. Note that even in most cases a false positive is acceptable if the chance of a false positive is less then the chance of a hardware error.
Not that just because it is probabilistic does not mean its not useful in many cases. The default setup with 1mil entrie memory and 1/1mil chance false positive chance will run at ~8M mem usage. This is a quite an acceptable trade off in many cases.
https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_(video_game)
Could you use this as the deduplication method in ZFS?
Depends on what you mean and your constraints. Deduplicate file entries if you can live with a rare false positives, sure. A setup of 1million entrie limit with 1/1billion false positive chance will get ~80M mem usage.
If you want to check for duplicate files, probably not.
> Could you use this as the deduplication method in ZFS?
I would say no.