How antispam filters work

Previous post on spam covered checkpoints where antispam filters can be applied . However nature of filters matters a lot for setting up effective antispam solution.

Manual moderation

Sadly manual moderation is most effective. No filters are sophisticated enough to deal with every spam attack possible and human decision is best authority here. It may be certainly unproductive to deal with everything manually but it’s not right to escape manual work. The more human input and tuning filters get the more effective can they become.

Custom filters

Most primitive form of filtering are custom rules. Something as simple as “if address is not mine then it’s spam” can sometimes do wonders. Creating filters manually may not be productive but it’s often pretty effective.

White list

White list assumes that every message that doesn’t come from previously approved sender is supposedly spam. It’s obviously good for keeping spam out. Together with every message from new sender. While having extreme rate of false positives on its own white listing is often applied first to get important messages through without risk of losing them to other filters. Some systems allow to automatically add to white list sources that passed moderation once.

Black list

Black list checks message against list of text strings and/or senders assuming it is spam if there is match found. Black list is extremely effective against huge volumes of spam with roughly same content. Overall black list is as good as you can make (or get) it. There is also danger of false positives if list reacts to words and phrases that can appear in legitimate messages.

Karma

Karma is like light mix of white and black lists. It takes in past moderation events and calculates modifiers for specific senders. It’s considered effective in long term but is not helpful against new messages or human spam that may start with few valid messages first.

Bayesian filtering

This one is based on pure math and is extremely effective. Bayesian filters keep database of all words they had ever encountered and how often they occurred in spam and non-spam messages. Upon receiving new message filter looks up all words in it and calculates probability of it being spam. Downside is that it relies on manual correction and slightly susceptible to poisoning - when big chunk of valid text is used to get small chunk of spam along. It doesn’t react to randomly generated text well either.

Behavior analysis

Instead of analyzing message this method tries to analyze sender. It looks for signs that are common for human operated software but uncommon for bots. Ability to process JavaScript is used as indicator very often for web but it gives false positives on old browsers (or those that have JS disabled).

Proof of work

This method makes sender perform additional tasks. They may be tasks that can only be performed by humans (captchas, questions) or calculating. Latter is kinda upgraded behavior analysis with additional effect of slowing down spam bot.

Removing value

Google had popularized nofollow attribute for links claiming it would reduce online spam by removing value of spammed links for search engine optimization. Well it had no effect on spam at all and nofollow was turned into weapon to fight advertisement paid links. Removing value is extremely ineffective because carpet bombing is main concept behind spam. It doesn’t really care to check if messages are bringing value on case by case basis.

Poisoning

Poisoning tries to render spam bot ineffective, usually by feeding it huge amount of falsified data. It’s not widely used and effectiveness is questionable.

Honey pot

Method tries to detect spam by getting extra action that won’t be performed by human in same conditions. Extra line in form that says “don’t fill me” is usual example.

Dynamic method of sending

Periodic change of method to send message prevents bots from remembering it. Disposable email addresses and changing contact forms fall under this. Can be effective (if automated) for reducing amount of spam but can’t eliminate it completely. It can also lead to losing messages if expired method is used to send valid one.

Spam collection

This methods usually relies on collaboration from multiply participants in creating huge spam database. Messages are simply checked against it by hash or otherwise. Effect depends on database quality and downside is that such database can be possibly poisoned for treating valid messages as spam.

Next post in series is going to cover some factors to consider in choosing filter and examples from my personal experience.

7 Comments

Lyndi 2008-10-31 #

Very interesting. I am not going to pretend that I understand all of this but at least I now do know a bit more about the anti-spam methods available out there.
Rarst 2008-10-31 #

@Lyndi Most antispam is marketed as install and get rid of spam. It's not that easy... and techies really like to pick their tools. And in case of spam it's really lot of tools to choose from.
How anti-spam filters work 2008-11-21 #

There are a few anti-spam tools that combine a lot of these filters and actually make it easier to get good spam capture rates out of the box. There's a balance that vendors have to strike between offering a multitude of spam filters to improve capture rates and providing a relatively easy and effective set up. The latter is extremely difficult to achieve and not for the faint hearted, as you rightfully pointed out Rarst. Great post!
Rarst 2008-11-21 #

@How anti-spam filters work Thanks for your visit and comment. I am plannig third post about spam to wrap this series so be sure to drop by or subscribe to my feed. :)
emailspam 2009-01-09 #

This is a very interesting post - good summary of a lot of different features (whether they work or not). We work hard on finding that balance between a filter that filters out only the spam but never filters out what you want. It is definitely a full time job.
Rarst 2009-01-09 #

@emailspam Thanks to getting to my spam series and welcome to blog. I start to feel power of Twitter. :) Indeed balance is extremely important in spam filtering. But lots of solutions are promoted as silver bullet while it's area where cheap shortcuts can do a lot of harm.
emailspam 2009-01-10 #

There is no such thing as a silver bullet for sure. Email spam filter is like a baby you have to feed all the time, but most people don't have time for that, so they have to go to someone else who at least claims they do.

Manual moderation

Custom filters

White list

Black list

Karma

Bayesian filtering

Behavior analysis

Proof of work

Removing value

Poisoning

Honey pot

Dynamic method of sending

Spam collection

Related Posts

7 Comments

Lyndi 2008-10-31 #

Rarst 2008-10-31 #

How anti-spam filters work 2008-11-21 #

Rarst 2008-11-21 #

emailspam 2009-01-09 #

Rarst 2009-01-09 #

emailspam 2009-01-10 #