The Autosort Bayesian Filter


Follow

Overview

FogBugz contains a sophisticated spam-blocking algorithm that learns how to recognize spam automatically as you train it. This algorithm is called Autosort. This article provides detailed information about how the Autosort filter works in FogBugz.

 

Information

 

How Does it Work?

Rather than using a fixed set of spam clues, for example, assuming that “mortgage” must mean spam, it learns from your incoming email. If you work for a bank, “mortgage” probably does not mean spam.

In addition to using only positive clues (for example, “V1agra” probably means spam), FogBugz will learn from negative clues as well (for example, if the email contains the name of one of your products, it is much less likely to be spam.) FogBugz examines many aspects of the incoming email for clues, which could be considered positive signs of spam, negative signs of spam, or neutral. And since you train it, it will adapt itself to the particular stream of email that you receive.

When you first install FogBugz and turn on FogBugz Autosort, FogBugz sets up a project named Inbox with three areas: Spam, Not Spam, and Undecided. At first, FogBugz Autosort has no clues at all about what messages are spam and what messages are not spam. All incoming messages are put straight into the Undecided area.

To train FogBugz Autosort, you need to teach it about every message in the Undecided area, either by flagging it as spam by clicking the Spam button or by moving it to the Not Spam area if it is not spam. Any time you see a message in the wrong area, take the time to move it to the right area. This will help train FogBugz Autosort.

After a few days, you should notice that FogBugz Autosort is correctly sorting most messages. In the first few days, there is a small chance that a few messages will be mistakenly flagged as spam. Do not worry about this, but do move them into the Not Spam area to help train FogBugz Autosort.

After you have received a bunch of spam and a bunch of nonspam, typically after a couple of days or about 100-200 messages, you will find that FogBugz Autosort is doing a really good job sorting messages automatically. But no matter how good it gets, it will always be undecided about some messages, and you will have to decide those cases yourself.

FogBugz Autosort tries to be conservative to avoid accidentally flagging a message as spam when it is not spam. In practice, we have found that even with an email address that receives hundreds of spam messages a day, it is extremely rare for FogBugz Autosort to mark something as spam that is a legitimate email accidentally. Our experience is that it is more common for humans to mistake a real email for spam than for FogBugz Autosort to make this mistake!

Unfortunately, there is always the possibility that a legitimate email from a customer will look so spammy that it gets deleted accidentally. If you are concerned about this, set aside some time to review the spam messages every few days just to be certain nothing legitimate is getting lost. Overall, though, you will find that FogBugz Autosort does a great job with very few “false positives.”

To save you time, FogBugz treats emails sorted as spam slightly differently. You will not receive notifications, auto-replies, or escalation reports regarding spam emails. Spam emails are also conveniently hidden from most views of your cases and summary reports, although they are still accessible with the click of a link.

 

Implementation Details

FogBugz implements a modified version of the Bayesian filtering algorithm proposed by Paul Graham in the articles A Plan for Spam and Better Bayesian Filtering, with modifications and improvements designed by Fog Creek technical staff.

 

Related Articles