Forwarding Address: OS X

Sunday, December 08, 2002

(This is rather long winded. Short version: check out SpamSieve.)

It's been a while since I posted about spam filtering. For many months I've been happily using mailfilter by Andreas Bauer, and I even put up some instructions to help other people use it. It has killed over 20,000 pieces of spam for me. However, with 300+ rules I have hit the point of diminishing returns -- it's hard to add individual rules that make much of an improvement in the overall accuracy, and the whole shebang is higher-maintenance than I want it to be. And, while in general I love mailfilter's kill-it-on-the-server method, it's a real bummer when you get that rare false positive, because you have to 1) notice that it happened 2) dig the sender's e-mail address out of the log, and 3) beg for a re-send.

Lots of people use and love SpamAssassin. It's great. However, it can be complex to set up and maintain, and you've got to keep it up to date as spammers learn to game its rules.

What if you had software that, instead of using tons of rules by clever humans, simply learned by example what spam is, and kept learning?

That's the hot thing in spam fighting now -- Bayesian filtering. I'll leave the details to smarter people, but it is essentially a statistical method in which individual tokens (words) are mapped to probabilities. For example, a quick look at my spam log of 700+ recent spams shows that my last name shows up in 4 spams and 254 "good" messages, making it a strong (but not absolute) indicator of non-spam. Conversely, the term "hcode" shows up in 304 spam messages and no legitimate messages, making it a very good indicator of spam. What's "hcode"? I have no idea -- something that shows up in spammers' HTML a lot, I'd guess. It's obviously incredibly predictive, yet I never would have created a rule to look for it.

That's the beauty of this approach. Instead of trying to cleverly create individual rules that identify spam, you simply feed your Bayesian engine a pile of spam, and a pile of good mail, and it learns the difference. (It does weighting like SpamAssassin, but instead of weighting rules, it individually weights every unique word.) Read Paul Graham's highly influential "A Plan for Spam" essay for more on this. Really, read it. It's excellent.

Actual usable software using these techniques includes Eric S. Raymond's (yes, that Eric S. Raymond) bogofilter, various applications of the ifile utility, and (finally, here we are) the product I'm trying out right now: SpamSieve.

SpamSieve works with Eudora (which is what I use), Entourage Mailsmith, Powermail, and Emailer. (What about Mail.app, you ask? Well, reportedly, Mail.app's built-in spam killer is actually a Bayesian filter, and works great. So if you are a Mail.app user, I guess you're all set...)

You train SpamSieve with a batch of good messages and a batch of bad, then correct it whenever it screws up. It learns fast.

People who are wrapped up in making clever rules for complex filtering systems can't believe how effective Bayesian filtering is. After just a few days of training, SpamSieve has exceeded the accuracy of my mailfilter setup (about 95%) and is still climbing -- even messages that it correctly identifies as spam are used to improve its accuracy. People have reported accuracy of up to 99.5% with Bayesian filtering.

I am fully convinced that this is the future of spam-killing. (Well, actually, I think there's also a place for Brightmail-type honeypots, but that's another story.) Check it out. Discuss