Please do not use the data here!
They are no longer current and have been deprecated!
Instead see the SURBL site.


Judging e-mail messages containing spam-referenced web sites

Introduction

This is an approach to block, reject or tag spams based on web sites they contain. Most spams direct readers to some relatively common web sites. Many of those sites get reported to SpamCop and other services. SpamCop creates a database of these "Spamvertised" spam-advertised sites. That data is grabbed periodically and served up to a SpamAssassin plugin which Eric Kolve has developed. Initial tests appear successful. Wild incoming messages that have the previously-reported spam-referenced URIs are successfully extracted, matched and tagged. Other uses of the data at other layers in the mail delivery process are possible, such as in message body-aware MTAs like Postfix or user or server mail filters like procmail and others.

This approach of filtering mail based on URIs contained in the message body is in contrast to conventional Realtime Black Lists (RBLs) which block based on the IP address or name of mail servers used to send spam. A shortcoming of conventional RBLs is that the spam sources frequently jump to addresses all over the world and are therefore difficult to stop initially. Each time a new IP address is exploited, it takes some time for the RBLs to get updated to catch them. Spammers jump to many different addresses to take advantage of this lag, since they can send out thousands of spams though them before the RBLs update. Spammers have even developed virus-like trojan horses to send just a few messages from hijacked computers before ceasing operation. This type of "bursty" spam broadcast is very difficult to stop with a conventional RBL because it is highly distributed and ephemeral. The web sites advertised in spam, however, are necessarily relatively stable, so blocking based on URIs is a logical alternative. It may be the only remaining alternative until we all go to signed mail.

SpamCopURI SpamAssassin plug-in to tag messages with reported spam sites

Eric Kolve has written a SpamAssassin (SA) 2.63 plug-in called SpamCopURI, which tests message bodies to see if they contain domains that have been reported in Spamvertised sites to SpamCop.

Initial versions of Eric's SA plug-in used the CGI-based web search feature described below to access our version of the SpamCop URI data. Later versions used our web directory tree of text data files instead. Subsequent intermediate revisions of SpamCopURI directly queried and cached the SpamCop Spamvertised site data, bypassing the data here. Eric has updated SpamCopURI to get the SpamCop URI data through a special RBL we created called SURBL, as an alternative to using its own local database. SURBL is summarized next. All the deprecated earlier methods are described below for historical reasons, and also because they can offer some other potentially useful glimpses into the data.

SURBL -- Spam URI Realtime Blocklist

Another use of the data here is the SURBL "Spam URI Realtime Blocklist". This unconventional RBL can be used to block spams based on reported spam message body domains. As with SpamCopURI, using a SpamAssassin 3.0 plugin URIDNSBL command urirhsbl with SURBL results in a very effective spam stopper. Spam detection rates of 40-60% are reported with near zero false positives. To reiterate, SURBL is not a conventional RBL. It must be used with a special SpamAssassin plugin or modified MTA or mail filter code in order to block spams based on domains found in their message bodies. Unlike conventional RBLs, SURBL does not block spams based on the source of the spam.

Rejecting spams in an MTA

Integration of this approach into a message-body visible MTA like Postfix would allow it to block spam-site-containing messages and discard or reject them back to the sending server. Milter development for using SURBL with sendmail is proceeding. One downside typically cited for this technique is that message bodies would then need to be processed on the server, thus incurring additional processing cost. But on many mail servers the MTA is followed by some additional mandatory anti-spam (header) processing, for example with SpamAssassin, so MTA inetgration doesn't necessarily add server load so much as shift it closer to the mail and spam source. Therefore blocking at the MTA would prevent the need for post-MTA processing by SA, etc. The upside is that if this approach could be made effective, for example through widespread use, then spam could decrease due to some loss of utility to spammers with this in turn reducing the processing load. In other words truly successful, widespread blocking of spam could reduce the incentive for spam in the first place since it would decrease its effectiveness. Less spam would mean less processing load to handle it.

On the other hand, in many installations every incoming message is touched by SpamAssassin, which also incurs a large server load. Using SURBL or similar spam body domain data inside an MTA may turn out to be less processor resource intensive than using SpamAssassin. This approach probably merits research and testing. Some strongly positive results using the SA plugins above suggest that MTA integration could likewise be highly successful in stopping spam.

Spam site database service

We have set up a simple flat text database with data periodically grabbed from SpamCop's Spamvertised Web Site page. In an ideal world we would convince them of the utility of this approach with our early efforts and they would give us more direct access to their data. In the mean time, we have this method of grabbing and using the data to prove the concept. Currently 24 hours worth of minute and URI unique data is stored. More details are available in the Notes section.
You can search for a URI (URL) or fragment with a query of the form:
http://spamcheck.freeapp.net/search-uri.cgi?your_search_term_here
You can search for a Fully Qualified Domain Name (FQDN) or fragment with a query of the form:
http://spamcheck.freeapp.net/search-fqdn.cgi?your_search_term_here
Searches are case insensitive and can match on partial strings. Expressions (e.g. wildcards) are not accepted in search terms, only fixed strings. A null (empty) result means no matches.
You can get a count of the number of times a pattern occurs in the URI (URL) database with a query of the form:
http://spamcheck.freeapp.net/count-uri.cgi?your_search_term_here
You can get a count of the number of times a pattern occurs in the Fully Qualified Domain Name (FQDN) database with a query of the form:
http://spamcheck.freeapp.net/count-fqdn.cgi?your_search_term_here
Regarding the counts, please note the caveats below about the source data. Actual matches from the source data will be under-counted due to the merge operation, which has one minute resolution to get around clock jitter. Quite a few matches that occur within the same minute will be discarded as a result. However at least one instance of every unique site will be included. Therefore any non-zero answer means the site is at least somewhat spammy according to the source reports. And larger numbers still indicate more reports, at least those recorded in different minutes.
You can get a percentage of the number of times a pattern occurs in the URI (URL) database with a query of the form:
http://spamcheck.freeapp.net/percent-uri.cgi?your_search_term_here
You can get a percentage of the number of times a pattern occurs in the Fully Qualified Domain Name (FQDN) database with a query of the form:
http://spamcheck.freeapp.net/percent-fqdn.cgi?your_search_term_here
Since the percentages are based on the the counts above, they are similarly imperfect and similarly still somewhat useful as relative indicators of spammyness. The denominator is a dynamic count of the database size. Percentages are from 0 to 100 with 3 digits of decimal precision. Non-zero percentages that fall below the precision are problematic since they would appear to be zero, when in fact they are not. Aside from that remote possibility, if there are no matches the value returned by web interface is 0.

Viewing the spam site database

See all of the most recent URIs.

See all of the most recent FQDNs.

Web directory tree of spam URI data

Since we felt that CGI access could be a performance bottleneck, we decided that later versions should use a directory tree of dynamically-updated static text files. The data would be in static files in a directory tree that would get new data added to them when it came in. There would be an expiration and pruning mechanism to purge old data. Eric's original design suggestion looked like this:
The way I think we can make this *really* efficient is if we go to flat text files and invert the domain. So what you would do is create a directory structure that looks something like the following:
  spamcheck.freeapp.net/
                       + com/index.html
                            + webuymed/index.html
                            + foobar/index.html
                                    + blah/index.html
You may not want to do the subdomains, but basically the idea is the you pre-generate all the results and push the matching to the user side. This may be a little heavier on the bandwidth side if we grab say all the *.com urls, but the side benefit is that we can ask that people request this service via caching proxy servers which will have no problem caching this data since its no longer in the query string.

To query for say 'http://www.preempt.biz/wicdhvidcisdwbx/frx.html', I would just grab:

  http://spamcheck.freeapp.net/biz/preempt/www/
This would return all the urls for www.preempt.biz. If I just requested http://spamcheck.freeapnet.net/biz/preempt, I would get all the *.preempt.biz spam urls.

This shouldn't be too expensive to generate, though you will need to have something prune the directories, which you could probably do by the modification time of the index.html file.

You would end up with a few hundred directories, but this would be very efficient and should scale to however much bandwidth you have to support this.

Well this is now implemented, and the live data can be browsed in the "domains" directory tree. We put the data into plain text index.txt files containing all the records of the current level and its children (branches and leaves below) with the exception that the top level domain or IP address has no such summary. Summaries start at the second level and go to three levels for domain names and all four levels for IP address URIs. In principle we can create summaries for any arbitrary levels, though we felt that the current arrangement was probably optimally useful for our purposes.

Like the history files that the CGIs use for data, the data in the domains directory tree are in flat text file databases, with one record per line, and with a timestamp and URI in each record separated by a tab, as is long-standing UN*X custom for simple applications. Specifics of the timestamp format are in the Notes section.

Rather than using index.html or enabling the DirectoryIndex web server directive for index.txt file, we left everything visible as plain directories and files. The automated lookups from Eric's SA plug-in or other access methods don't need this visibility, but it can be handy for human browsing of the URI data, as you may have seen if you followed the link above.

All data in the directory tree are currently expired after 4 days, including records within each individual file and the tree structure itself. Other applications using this data may be designed to make finer-grained use of the timestamps to increase their time-relevance as necessary or desired. Eric observed the four days worth of data seemed to adequately capture most of the relevant spam URIs. We may tune these time values later.

Both tree and CGI versions of the data exist for now, though we may deprecate the CGIs later in favor of the much less server-CPU-intensive tree. We could also get rid of or disable the CGIs but keep the history database files for client applications to make use of on their own.

Downloading the code for the spam site data service

You can get a gzipped tar of the code for this spam data service as spam-uri-data-service-1.30.tgz. Here's the README which includes installation instructions and a description of the programs and files within.

Comments, open issues and future directions

  1. Whitelisting: The MTA or mail processing program probably should make use of a whitelist to prevent the blocking of messages that exclusively mention legitimate sites. All that should be necessary is a list of legitimate, non-spam-friendly domain names to match message-contained URIs against. Whitelists in general are pretty widely discussed and used in anti-spam efforts. Their use could be extended here. (Update: Eric's SA plug-in supports whitelisting and blacklisting of message body URIs.) (Additional update: as discussed on the SURBL site, since legitimate domains seldom get into SURBL, the now preferred method of accessing this data, whitelisting on the SURBL-using client side becomes far less necessary.)

  2. Scaling: The current proof of concept design results in pretty intensive web server hits. This probably would not scale well under in wide use. (Pretty much all of the scaling issues have been addressed well by moving to an RBL, which is DNS based. The following concerns are made moot, but included to show some of the ideas we went through while trying to re-invent the wheel in avoiding the use of an RBL. :-) Some possible solutions include:

    1. Caching of lookups on client site. Caching could help performance significantly since spam sites seem repeated often (until the randomizers catch on). (Update: Eric's earlier SA plug-in made use of client-side caching.) (Additional update: since Eric is now updating SpamCopURI to use SURBL, all the caching, distribution and expiration is efficiently handled by DNS. Leveraging of DNS mechanisms for this purpose is a strong advantage of RBLs.)

    2. Setting up local servers closer to each logical client network (Mooted by DNS use of SURBL. Data servers are now as near as the name server and local cache.)

    3. Volunteer server farm (Mooted by DNS use of SURBL. Data servers are now as near as the name server and local cache.)

    4. Use of a more efficient protocol than http. (On the other hand, use of the web protocol enables the leveraging of much already-existing content clustering and mirroring technology.) (Mooted by DNS use of SURBL. DNS is a very efficient distributed database.)

    5. Distributing the entire database to every client with periodic incremental updates (Mooted by DNS use of SURBL, which of course distributes the database in the form of a zone file.)

  3. This is a useful start to show what this approach can do. Even with the limitations, it clearly can be rather effective. The approach could be developed further from here. (And it has been.)

Notes

  1. This is an Experimental Service. It may be subject to delay, unavailability and errors. Use at your own risk. We will, however make a reasonable, voluntary effort to keep it running and accurate. We also intend to publish all code openly for review and further development. If the concept looks promising we may scale it up with multiple servers, mirroring, round robin DNS load balancing, geographic clusters, etc.

  2. You may of course install the code for this server and run it locally. Some centralization and mirroring of server content may be useful in order to cut down on the load on SpamCop's servers. An alternate approach would be to serve up the entire database to clients and push the processing of the database onto the clients. That would probably maximally distribute the load, but it makes the client requirements more complex.

  3. Timestamps are GMT (UTC), not local time.

  4. Timestamps in format YYYY-MM-DD HH:MM where YYYY is the four digit year, MM is the two digit month, DD is the two digit day, HH is the two digit hour (24 hour clock), MM is the two digit minute. Other formats are probably possible, provided the components appear with all digits and in this order, so that times can be easily sorted numerically.

  5. Seconds are deliberately not used because the system timing accuracy may not be fine-grained enough to support their use.

  6. Results are sorted and uniqued based on each entries' URI or FQDN and their one minute resolution timestamp. Because of the uniquing operation, duplicate entries occurring within the same minute are not shown. This means the data cannot be used as a count of reports, only a confirmation that a particular pattern was reported.

  7. However every unique FQDN or URI will be present, so the data is still useful as a yes or no confirmation of prior reporting.

  8. Timestamp and URI or FQDN are separated by a tab, as the default from the *NIX paste command.

  9. Sort order is most recent first.

  10. Even with the somewhat coarse one minute timestamp resolution, some jitter may be apparent in the source data, causing the calculated timestamps to occasionally wander around with slightly inconsistent results. For a general URL detection use, this jitter in the data is probably not a significant issue.

  11. The FQDN version of the data may be better for matching purposes since some of the deliberate entropy which is sometimes added is ignored, such as keywords, subdirectories, etc. But the host portion, usernames, etc are still potentially problematic.

  12. Searches, counts and percentages of URIs and FQDNs won't match exactly since the URIs tend to be more unique. Being more generalized views of related source data means the FQDNs may be more generally useful. I would therefore probably recommend using the FQDN percentages for spam-ranking systems.

  13. Since the source data is dynamic, results will change over time. The same query may not return the same result a few minutes later.

  14. Additional algorithms can and probably should be applied to remove more of the variations that can be deliberately added to domain names while still producing a working URL. It would be nice to find a way to remove all but the registered domain name, but that seems a non-trivial task, particularly given subdomains, TLDs that are more than one dot-separated word, geographic variation in TLDs, etc. Techniques for extracting registered domains are probably extant. (Note that this is not saying "how do you look up a domain at a registrar." The problem is extracting the actual domain name from all the muck. Humans do it trivially easily. Machines less trivially so. Yet I'm sure it can be done.)

  15. Source for the csh shell script that grabs and processes the data is shown here. Being a shell script calling mostly compiled C programs, it should be a fairly efficient system, though I make no specific claims to that end. Improvement is often possible.

  16. Source data is folded to lower case immediately after it's gotten. This is due to the limitation that the sed here cannot seem to do case insensitive searches. This could affect homemade client side pattern matching, which should be made case insensitive.

  17. We are re-writing this code to work in a faster, more streamlined way, but we will keep this site up as a reminder of an older, but functional system.

Version 1.11 by Jeff Chan on 4/6/04