This information ought to be top and center, and it isn't.
No. It gives a very rough guide to "how much trouble is this spam filter going to be?". If you can say that X000 users have found only Y% of their email was misclassified, and I can compare against other spam filters, that's really useful.
So yes, too many variables for one to be accurate, but good enough to gauge average-performance across a tribe of users.
Regarding statistics, rspamd uses OSBF-Bayes classifier and 5-gramms input (so it is not naive bayes). I've used the following academic paper: http://osbf-lua.luaforge.net/papers/osbf-eddc.pdf as reference. This algorithm is also used for crm114 spam classifier. However, bayes classifier is a very small part of rspamd (unlike dspamd, for example) and it could be almost useless if you have, let's say, 50 millions of users accounts. Rspamd is targeted for this grade systems.
I agree that spam is a moving target and that is why anti-spam systems need constant updating. My current system (over the last 30 days) rejected 87% (around 45k emails) and accepted 13%. Of that 13% (6600) around 300 were classified as spam by the bayesian classifier in thunderbird. Around 80 were manually classified as spam and added to thunderbird's rules. The thunderbird classifier probably classified 2 ham messages as spam. I don't know of any ham->spam errors in the initial filtering phase.
Should rspamd be expected to do better, about the same, or worse?
It most definitely is not. It's the most important factor when choosing a spam filter.
False positives are extremely harmful (it can result in loss of communication, which is what you want to avoid the most). A significant amount of false positives is what would make the difference between useful or useless.
Nobody want to tell their users "check your spam mailbox, (the one with dozens of spam messages) for ham every once in a while)".
Also I suppose that the false positive/negative rate can only be given on a well defined corpus, I'm not sure there is one that is a good representation of the current and future spam trends, so in the end giving those numbers could be very misleading.
I see some interesting things like the surbl module but other than that this seems to be more like mimedefang (or that's at least what I've picked up from the landing page).
Also do you consider supporting multiple database drivers or will you stick with sqlite3?
Multiple database drivers are in plans for rspamd 1.0 (along with personal statistics and advanced rules planner). The tricky stuff here is that rspamd uses non-blocking model currently which is hardly supported by database drivers (excluding redis and some others). However, rspamd has a concept of asynchronous threads executed in thread pool. So something like MySQL query could be executed within this thread pool with no delay for other filters processing.
Reasons for me not to give it a try:
- Rule based mostly (which I think of as 'SA')
- No db support, as far as I could tell. My dspam keeps everything in a postgresql db and I can easily backup/restore that with all my other stuff (dovecot/postfix virtual users, for example)
- ~Easy~ to integrate into anything. Look for 'how can I make dovecot-antispam integrate with dspam' and that's been done a thousand times (and works nicely). I haven't found a decent number of rspam resources
That said: My whole post basically says that I didn't try it (for reasons that were important to me). Their site looks interesting and in the end I guess I'd love to hear about successful dspam->rspam migrations as well.
Then, dspam started segfaulting, and none of my e-mail was delivered. I looked into what was going on, and it appeared that the dspam hash database had somehow become corrupted; and since dspam is completely unmaintained these days, it was unlikely that whatever bug I tripped upon would ever be fixed.
Sigh. I also would like to hear user reports about rspamd! I am getting sick of the false negative rate that I'm getting from SpamAssassin.
Rspamd appears to use sqlite3: https://rspamd.com/doc/workers/fuzzy_storage.html
Thanks, but.. That's not quite what I had in mind. For one, somedb-only (sqlite or anything else) is usually not enough. I would hesitate to introduce a system that just supports mysql when everything else is using postgresql for me, for example. And on top of that, this schema is .. limited. My dspam setup learns and can do that for each and every user (though system wide training seems to be the norm, as far as I can tell). This is really just a storage engine as far as I can tell and not really comparable.
That said: I guess I would give rspam a try if I saw a lot of positive reviews/reports. It's just that it certainly doesn't do the same thing as dspam. It's quite a different animal.