Spam Filtering

Last updated October 2005

The Good

DNS Block Lists

DNS Block Lists (DNSBLs) are lists of IP addresses that have been identified as sources of spam, and which use DNS for rapid updates. [details]. There are dozens of DNSBLs in common use, and each one is typically maintained by a few (1-3) core members who have control over all additions and removals. Each list usually has a specialty or fixed set of rules for determining inclusion; for example, one list may focus on open relays, while another list may focus on machines that are distributing viruses.

DNSBLs are very efficient. With most spam-blocking techniques, each message must be transferred (increased bandwidth), stored (increased disk usage), and processed (increased CPU time) in order to determine whether or not it is spam. By contrast, a DNSBL query requires very little bandwidth, and messages are rejected before the contents are ever sent.

DNSBLs are also very effective. There is no limit to the number of lists that may be used, and each person is able to choose a set that fits their individual needs. Using only a couple less aggressive lists, it is possible to block a majority of spam with virtually no risk of false positives. Using a few of the more aggressive lists as well will cut spam to a trickle.

[Note: DNSBLs are so effective that virtually all of the major list servers have been the targets of prolonged DDoS attacks by frustrated spammers.]

The primary drawback of DNSBLs is that you have to put trust in a group of strangers to do the right thing. If the DNSBL maintainers aren't continually updating the database, new spam-friendly IP addresses won't be added and old IP addresses that are not longer being used for spam won't be removed. Furthermore, the maintainers must be trusted not to add IP addresses to the list without sufficient reason.

It is important to find DNSBLs that are trustworthy, reliable, effective, and have acceptably low rates of false positives. Though everyone will have their own idea of what constitutes "acceptable", here are a few places to start:

Spamhaus
The combined Spamhaus list, sbl-xbl.spamhaus.org, is highly effective and has a very low rate of false positives. It is a combination of three lists: sbl.spamhaus.org, cbl.abuseat.org, and opm.blitzed.org.
SORBS
The "safe" SORBS aggregate list, safe.dnsbl.sorbs.net, is very comprehensive and very effective, and generates few false positives. The "unsafe" aggregate list, dnsbl.sorbs.net will block noticeably more spam, but the drawback is a significantly increased risk of false positives. SORBS also provides access to the dozen or so individual lists that make up these aggregate lists.
Spamcop
bl.spamcop.net is a rather aggressive blacklist that catches a lot of spam at the risk of an occasional false positive.

Using all three of these DNSBLs will typically eliminate at least 90% of incoming spam.

[Note: Busy mail servers are advised limit queries to three or four DNSBLs, and to use ones that are especially reliable.]

Checksum Filters

Checksum filters work by taking a checksum of certain parts of an email and querying a database for matches. Server-side checksum filters provide a good second line of defense behind DNSBLs, stopping mail from machines that have not yet been identified by DNSBLs. There are two primary categories of checksum filters: Antivirus filters and content filters.

Antivirus Filters

Antivirus filters are particularly useful at stopping fast-spreading viruses that can outpace DNSBLs during large outbreaks. They generally have a very low rate of false positives, since a message that does not contain a known virus will not be blocked. Legitimate messages will only be blocked if the sender has unwittingly managed to get a virus attached to the mail. An antivirus filter integrated into the mail server will be able to block viruses at the SMTP level, making it possible to immediately notify the sender that a message has been rejected. This mitigates the consequences of false positives.

[Note: ClamAV is an open-source (GPL) antivirus filter that can be run from the command line and/or integrated with a mail server.]

Content Checksum Filters

Content checksum filters take checksums of the textual content of email messages and query a database to see if similar messages have been reported by enough people to qualify it as spam. Because spam is generally sent to thousands or millions of recipients using the same text, the first few people to get a copy can report it to the checksum database and protect future recipients.

Content checksum filters can occasionally generate a false positive if a sufficient number of people report a message as spam when in fact it is not. However, this is relatively rare, and for the most part it does not affect non-bulk mail (i.e. mail that has not been sent to thousands of other people). The primary drawback of content checksum filters is that they can be tricked by varying the contents of email enough, and the algorithms used to determine matches must be constantly evolving to keep the spam detection accurate while keeping the number of false positives low. (The random nonsense words often seen in spam are intended to trick checksum filters.)

[Note: See Distributed Checksum Clearinghouse (DCC), Razor, and Pyzor.]

Statistical Filters

Recent statistical filters (of which Bayesian filters are one example) can be very effective at blocking spam. These filters are an improvement over the previous generation of static filters, which consisted of a predefined set of spam triggers. By contrast, statistical filters learn what an individual user considers spam, and over time evolve to be highly effective personalized spam blockers.

However, like the static filters that preceded them, statistical filters are not foolproof—they will occasionally generate a false positive, especially in the early stages before they've been properly trained. This presents a problem: Because statistical filters are generally run on the client, the sender will have no way of knowing if a message was marked as spam. So unless you periodically look over your spam (which somewhat defeats the purpose of a filter), you will inevitably miss some legitimate email.

[It is possible to run statistical filters on the server that operate at the SMTP level, but since this is a CPU-intensive operation it is not always possible in a mass virtual hosting environment. Furthermore, it requires significantly more effort to configure and train server-based filters since mail clients are not designed to work with them by default.]

[Note: DSPAM is a server-side statistical filter that claims 99.9%+ accuracy with low resource usage. SpamAssassin can also be configured to run on the server.]

The Bad

Email Tax

An email tax is often proposed as a solution to spam; for example, charging a penny per email. The idea is to make the cost small enough so as not to have a significant effect on the average user, but high enough to discourage spammers. Major problems include (1) how such small payments would be accepted, (2) how to account for the great disparity in income levels around the world, and (3) the fact that a vast amount of spam comes from ordinary users' machines that have been infected with viruses and turned into "zombies" for the spammers.

Resource Expenditure

This method would require the sending computer to perform some small computation before a message was accepted. This would in theory be a small enough computation so as not be a burden to a non-spammer, but would quickly bog down a machine sending out millions of messages. This is conceptually equivalent to the email tax, and has many of the same problems.

The Ugly

Sender Forgery Prevention

Sender Forgery Prevention aims to eliminate forged email return addresses. The purpose of this is not to directly combat spam, but rather to put an end to "joe jobs", where spammers make it look like email is coming from someone else, resulting in the recipients of spam blaming the wrong party.

SPF is the most mature implementation, and has the backing of AOL, but overall it has received a lukewarm reception due to several issues. A proposal by Microsoft called Sender ID is similar to and compatible with SPF, but is even more problematic due to it being patent encumbered and requiring explicit licensing. Finally, Yahoo has proposed a solution called DomainKeys, but because it requires annual purchases from third-party certificate authorities, it is unlikely that DomainKeys will be widely adpoted.

The general concept behind Sender Forgery Prevention seems sound, but the current implementations all have significant hurdles to overcome before any of them see widespread adoption. Not the least of these hurdles is going to be getting the larger email providers (AOL, Hotmail, Yahoo) to agree on a single method, as having several overlapping solutions all attempting to solve the same problem will undoubtedly hinder acceptance.

Challenge/Response

Challenge/Response works on a whitelist basis: Whenever someone who is not on your whitelist sends you an email, an automated reply is sent back with instructions telling the sender how to get added to your whitelist.

This will obviously block all spam that has been sent using a forged return address, but not all spam has a forged return addresses. If this ever became popular, spammers would begin writing software to automate the process of getting on the whitelist. Numerous workarounds for this have been proposed, but none are without significant problems. [For example, a visual challenge is often proposed because computer visual processing is not up to human levels. But this immediately shuts out blind users...]

Challenge/response also fails in situations where you need to receive mail from an entity that can not (or will not bother to) reply to a challenge. Consider subscribing to a mailing list: You send off a subscription request, then the list will sends back a verification request. But since the list's addresses is not on your whitelist, you send it a verification request to which it never replies, and you never get subscribed.

Or consider the following situation: You fill out a form on the web asking for some information to be mailed to you. Since you don't know in advance the email address that the sender will be using, you can't add the address to your whitelist. But if the sender does not feel like taking the time to jump through the hoops required to send a mail to you, he may just decide not to bother sending you the information after all.

Solutions to these problems usually involve having temporary email addresses that are configured to accept mail without issuing a challenge. For example, if your primary email address is "joe@job.com", and you have it configured to always issue a challenge to non-whitelisted parties, then you would subscribe to a mailing list using the address "joe-somelist@job.com", which was configured to never issue a challenge. This way, if "joe-somelist@job.com" started getting spammed it could simply be dumped, and you could re-subscribe using a fresh address. Variations on this technique, such as addresses with a timeout period, may be used as the situation warrants. See the TMDA client configuration document for examples.

The overall problem with this method is that it requires a lot of work on the part of the users. Senders going though the verification process, receivers maintaining a whitelist, setting up temporary addresses... It's quite a hassle.

[Sample software: TMDA]

Greylisting

Greylisting is the practice of rejecting mail from unknown locations with a "try again later" message, then accepting delivery on the second attempt. The theory is that the software spammers are using will not retry delivery, but legitimate MTAs will. In practice, it turns out that a significant number of legitimate MTAs will not retry delivery (resulting in lost mail), and the delivery delays are often frustrating when people have come to expect email transmission to take a matter of seconds. Furthermore, it seems that if greylisting ever became popular, spammers would simply update their software to automate the second attempt.

Port Blocking

Port blocking is different from the previously mentioned methods of spam control, in that the blocking performed at the sender's end. It requires the sender's ISP blocking outbound connections to port 25, thus disabling a potential spammer's access to open relays. Port blocking is especially common among dial-up ISPs, since it is easy and cheap for spammers to jump from one "throw-away" account to another.

A downside to ISPs blocking outbound port 25 is that it also blocks legitimate connections to remote servers. For example, suppose you had a personal domain with a hosting company, and you wanted to use that hosting company's mail servers (running on port 25) instead of the ISPs mail servers. If the ISP is blocking outbound port 25, this will not be possible. [Note: A good hosting provider will offer authenticated SMTP on a port other than 25, which gets around this issue.]

Another (possibly more problematic) issue is that of making remote port 25 diagnostics difficult. Consider the case of an administrator working from home on a remote server: If the ISP is blocking outbound connections to port 25, it would not be possible to telnet to that port to issue manual commands for debugging purposes. This could make remote mail server maintenance difficult or impossible in some cases.


Links

Software

The following is a list of open-source software packages that are in some way related to the fighting of spam. There is no particular order to this list.

Reading