This information ought to be top and center, and it isn't.
No. It gives a very rough guide to "how much trouble is this spam filter going to be?". If you can say that X000 users have found only Y% of their email was misclassified, and I can compare against other spam filters, that's really useful.
So yes, too many variables for one to be accurate, but good enough to gauge average-performance across a tribe of users.
I agree that spam is a moving target and that is why anti-spam systems need constant updating. My current system (over the last 30 days) rejected 87% (around 45k emails) and accepted 13%. Of that 13% (6600) around 300 were classified as spam by the bayesian classifier in thunderbird. Around 80 were manually classified as spam and added to thunderbird's rules. The thunderbird classifier probably classified 2 ham messages as spam. I don't know of any ham->spam errors in the initial filtering phase.
Should rspamd be expected to do better, about the same, or worse?
It most definitely is not. It's the most important factor when choosing a spam filter.
False positives are extremely harmful (it can result in loss of communication, which is what you want to avoid the most). A significant amount of false positives is what would make the difference between useful or useless.
Nobody want to tell their users "check your spam mailbox, (the one with dozens of spam messages) for ham every once in a while)".
I see some interesting things like the surbl module but other than that this seems to be more like mimedefang (or that's at least what I've picked up from the landing page).
Also do you consider supporting multiple database drivers or will you stick with sqlite3?
Multiple database drivers are in plans for rspamd 1.0 (along with personal statistics and advanced rules planner). The tricky stuff here is that rspamd uses non-blocking model currently which is hardly supported by database drivers (excluding redis and some others). However, rspamd has a concept of asynchronous threads executed in thread pool. So something like MySQL query could be executed within this thread pool with no delay for other filters processing.
Reasons for me not to give it a try:
- Rule based mostly (which I think of as 'SA')
- No db support, as far as I could tell. My dspam keeps everything in a postgresql db and I can easily backup/restore that with all my other stuff (dovecot/postfix virtual users, for example)
- ~Easy~ to integrate into anything. Look for 'how can I make dovecot-antispam integrate with dspam' and that's been done a thousand times (and works nicely). I haven't found a decent number of rspam resources
That said: My whole post basically says that I didn't try it (for reasons that were important to me). Their site looks interesting and in the end I guess I'd love to hear about successful dspam->rspam migrations as well.
Then, dspam started segfaulting, and none of my e-mail was delivered. I looked into what was going on, and it appeared that the dspam hash database had somehow become corrupted; and since dspam is completely unmaintained these days, it was unlikely that whatever bug I tripped upon would ever be fixed.
Sigh. I also would like to hear user reports about rspamd! I am getting sick of the false negative rate that I'm getting from SpamAssassin.
Rspamd appears to use sqlite3: https://rspamd.com/doc/workers/fuzzy_storage.html