Instead see the
SURBL site.
Judging e-mail messages containing spam-referenced web sites
Introduction
This is an approach to block, reject or tag
spams based on web sites they contain.
Most spams direct readers to some relatively common web sites.
Many of those sites get reported to
SpamCop
and other services.
SpamCop creates a database of these "Spamvertised" spam-advertised sites.
That data is grabbed periodically and served up to a
SpamAssassin plugin which Eric Kolve has developed.
Initial tests appear successful.
Wild incoming messages that have the previously-reported
spam-referenced URIs are
successfully extracted, matched and tagged.
Other
uses of the data at other layers in the mail delivery
process are possible, such as in message body-aware MTAs like Postfix
or user or server mail filters like procmail and others.
This approach of filtering mail based on URIs contained in the message body
is in contrast to conventional Realtime Black Lists (RBLs) which block based on
the IP address or name of mail servers used to send spam.
A shortcoming of conventional RBLs is that the spam sources frequently
jump to addresses all over the world
and are therefore difficult to stop initially.
Each time a new IP address is exploited,
it takes some time for the RBLs to get updated to catch them.
Spammers jump to many different addresses to take advantage
of this lag, since they can send out thousands of spams though them
before the RBLs update.
Spammers have even developed virus-like trojan horses to send just a few
messages from hijacked computers before ceasing operation.
This type of "bursty" spam broadcast is very difficult to stop with
a conventional RBL because it is highly distributed and ephemeral.
The web sites advertised in spam, however, are necessarily relatively stable,
so blocking based on URIs is a logical alternative.
It may be the only remaining alternative until we all
go to signed mail.
SpamCopURI SpamAssassin plug-in to tag messages with reported spam sites
Eric Kolve has written a
SpamAssassin
(SA) 2.63 plug-in called
SpamCopURI,
which tests message
bodies to see if they contain domains that have been reported
in Spamvertised sites to SpamCop.
Initial versions of Eric's SA plug-in used the CGI-based web search feature
described below to access our version of the SpamCop URI data.
Later versions used our web directory tree of text data files instead.
Subsequent intermediate revisions of SpamCopURI directly queried and cached
the SpamCop Spamvertised site data, bypassing the data here.
Eric has updated SpamCopURI to get the SpamCop URI data through a special RBL
we created called SURBL,
as an alternative to using its own local database.
SURBL is summarized next.
All the deprecated earlier methods are described below for historical reasons,
and also because they can offer some other
potentially useful glimpses into the data.
SURBL -- Spam URI Realtime Blocklist
Another use of the data here is the
SURBL
"Spam URI Realtime Blocklist".
This unconventional RBL can be used to block spams based on reported
spam message body domains.
As with SpamCopURI, using a SpamAssassin 3.0 plugin
URIDNSBL
command urirhsbl
with SURBL results in a very effective spam stopper.
Spam detection rates of 40-60% are reported with near zero false positives.
To reiterate, SURBL is not a conventional RBL. It must be used with
a special SpamAssassin plugin or modified MTA or mail filter code
in order to block spams based on domains found in their message bodies.
Unlike conventional RBLs,
SURBL does not block spams based on the source of the spam.
Rejecting spams in an MTA
Integration of this approach into a message-body visible MTA like
Postfix would allow it to block spam-site-containing messages
and discard or reject them back to the sending server.
Milter development for using SURBL with sendmail is proceeding.
One downside typically cited for this technique is
that message bodies would then need to be processed on the server,
thus incurring additional processing cost.
But on many mail servers the MTA is followed by some additional
mandatory anti-spam (header) processing, for example with SpamAssassin,
so MTA inetgration doesn't necessarily add server load so much
as shift it closer to the mail and spam source.
Therefore blocking at the MTA would prevent
the need for post-MTA processing by SA, etc.
The upside is that if this approach could be made effective,
for example through widespread use,
then spam could decrease due to some loss of utility to spammers
with this in turn reducing the processing load.
In other words truly successful, widespread blocking of spam could
reduce the incentive for spam in the first place
since it would decrease its effectiveness.
Less spam would mean less processing load to handle it.
On the other hand,
in many installations every incoming message is touched by SpamAssassin,
which also incurs a large server load.
Using SURBL or similar spam body domain data
inside an MTA may turn out to be less processor resource
intensive than using SpamAssassin.
This approach probably merits research and testing.
Some strongly positive results using the SA plugins above suggest
that MTA integration could likewise be highly successful in stopping spam.
Spam site database service
We have set up a simple flat text database with data periodically grabbed from
SpamCop's
Spamvertised
Web Site page.
In an ideal world we would convince them of the utility of this
approach with our early efforts and they would give us more
direct access to their data.
In the mean time, we have this method of grabbing and using the data
to prove the concept.
Currently 24 hours worth of minute and URI unique data is stored.
More details are available in the Notes section.
- You can search for a URI (URL) or fragment
with a query of the form:
-
http://spamcheck.freeapp.net/search-uri.cgi?your_search_term_here
- You can search for a Fully Qualified Domain Name (FQDN) or fragment
with a query of the form:
-
http://spamcheck.freeapp.net/search-fqdn.cgi?your_search_term_here
Searches are case insensitive and can match on partial strings.
Expressions (e.g. wildcards) are not accepted in search terms,
only fixed strings.
A null (empty) result means no matches.
- You can get a count of the number of times a pattern occurs in the
URI (URL) database
with a query of the form:
-
http://spamcheck.freeapp.net/count-uri.cgi?your_search_term_here
- You can get a count of the number of times a pattern occurs in the
Fully Qualified Domain Name (FQDN) database
with a query of the form:
-
http://spamcheck.freeapp.net/count-fqdn.cgi?your_search_term_here
Regarding the counts, please note the caveats below about the source data.
Actual matches from the source data will be under-counted due to the
merge operation, which has one minute resolution to get around clock jitter.
Quite a few matches that occur within the same minute will be discarded
as a result.
However at least one instance of every unique site will be included.
Therefore any non-zero answer means the site is at least somewhat
spammy according to the source reports.
And larger numbers still indicate more reports, at least those
recorded in different minutes.
- You can get a percentage of the number of times a pattern occurs in the
URI (URL) database
with a query of the form:
-
http://spamcheck.freeapp.net/percent-uri.cgi?your_search_term_here
- You can get a percentage of the number of times a pattern occurs in the
Fully Qualified Domain Name (FQDN) database
with a query of the form:
-
http://spamcheck.freeapp.net/percent-fqdn.cgi?your_search_term_here
Since the percentages are based on the the counts above, they
are similarly imperfect and similarly still somewhat useful as
relative indicators of spammyness.
The denominator is a dynamic count of the database size.
Percentages are from 0 to 100 with 3 digits of decimal precision.
Non-zero percentages that fall below the precision are problematic
since they would appear to be zero, when in fact they are not.
Aside from that remote possibility, if there are no matches the
value returned by web interface is 0.
Viewing the spam site database
See all of the
most recent URIs.
See all of the
most recent FQDNs.
Web directory tree of spam URI data
Since we felt that CGI access could be a performance bottleneck,
we decided that later versions should use a directory tree
of dynamically-updated static text files.
The data would be in static files in a directory
tree that would get new data added to them when it came in.
There would be an expiration and pruning mechanism to
purge old data.
Eric's original design suggestion looked like this:
The way I think we can make this *really* efficient is if we go
to flat text files and invert the domain. So what you would do
is create a directory structure that looks something like the
following:
spamcheck.freeapp.net/
+ com/index.html
+ webuymed/index.html
+ foobar/index.html
+ blah/index.html
You may not want to do the subdomains, but basically the idea
is the you pre-generate all the results and push the matching to
the user side. This may be a little heavier on the bandwidth
side if we grab say all the *.com urls, but the side benefit is
that we can ask that people request this service via caching
proxy servers which will have no problem caching this data since
its no longer in the query string.
To query for say 'http://www.preempt.biz/wicdhvidcisdwbx/frx.html',
I would just grab:
http://spamcheck.freeapp.net/biz/preempt/www/
This would return all the urls for www.preempt.biz. If I just
requested http://spamcheck.freeapnet.net/biz/preempt, I would get
all the *.preempt.biz spam urls.
This shouldn't be too expensive to generate, though you will need
to have something prune the directories, which you could probably
do by the modification time of the index.html file.
You would end up with a few hundred directories, but this would
be very efficient and should scale to however much bandwidth you
have to support this.
Well this is now implemented,
and the live data can be browsed in the
"domains" directory tree.
We put the data into
plain text index.txt files containing all the records of the
current level and its
children (branches and leaves below) with the exception that
the top level domain or IP address has no such summary.
Summaries start at the second level and go to three levels
for domain names and all four levels for IP address URIs.
In principle we can create summaries for any arbitrary levels,
though we felt that the current arrangement was probably optimally useful
for our purposes.
Like the history files that the CGIs use for data,
the data in the domains directory tree are in flat text file databases,
with one record per line,
and with a timestamp and URI in each record separated by a tab,
as is long-standing UN*X custom for simple applications.
Specifics of the timestamp format are in the Notes section.
Rather than using index.html or enabling the DirectoryIndex
web server directive for index.txt file,
we left everything visible as plain directories and files.
The automated lookups from Eric's SA plug-in or other access methods
don't need this visibility,
but it can be handy for human browsing of the URI data,
as you may have seen if you followed the link above.
All data in the directory tree are currently expired after 4 days,
including records within each individual file and the tree structure itself.
Other applications using this data may be designed to make
finer-grained use of the timestamps to increase their time-relevance
as necessary or desired.
Eric observed the four days worth of data seemed to adequately
capture most of the relevant spam URIs.
We may tune these time values later.
Both tree and CGI versions of the data exist for now, though
we may deprecate the CGIs later in favor of the much less
server-CPU-intensive tree.
We could also get rid of or disable the CGIs but keep the history database
files for client applications to make use of on their own.
Downloading the code for the spam site data service
You can get a gzipped tar of the code for this spam data service
as spam-uri-data-service-1.30.tgz.
Here's the
README which includes installation instructions
and a description of the programs and files within.
Comments, open issues and future directions
- Whitelisting: The MTA or mail processing program
probably should make use of a whitelist to prevent
the blocking of messages that exclusively mention legitimate sites.
All that should be necessary is a list of legitimate, non-spam-friendly
domain names to match message-contained URIs against.
Whitelists in general are pretty widely discussed and used in anti-spam
efforts. Their use could be extended here.
(Update: Eric's SA plug-in supports whitelisting
and blacklisting of message body URIs.)
(Additional update: as discussed on the SURBL site,
since legitimate domains seldom get into SURBL, the
now preferred method of accessing this data, whitelisting
on the SURBL-using client side becomes far less necessary.)
- Scaling: The current proof of concept design results in
pretty intensive web server hits.
This probably would not scale well under in wide use.
(Pretty much all of the scaling issues have been addressed well by
moving to an RBL, which is DNS based. The following concerns are
made moot, but included to show some of the ideas we went through
while trying to re-invent the wheel in avoiding the use of an RBL. :-)
Some possible solutions include:
- Caching of lookups on client site.
Caching could help performance significantly since spam sites seem repeated
often (until the randomizers catch on).
(Update: Eric's earlier SA plug-in made use of client-side caching.)
(Additional update: since Eric is now updating SpamCopURI
to use SURBL, all the caching, distribution and expiration is
efficiently handled by DNS. Leveraging of DNS mechanisms for this
purpose is a strong advantage of RBLs.)
- Setting up local servers closer to each logical client network
(Mooted by DNS use of SURBL. Data servers are now as near as the
name server and local cache.)
- Volunteer server farm
(Mooted by DNS use of SURBL. Data servers are now as near as the
name server and local cache.)
- Use of a more efficient protocol than http.
(On the other hand, use of the web protocol enables the leveraging
of much already-existing content clustering and mirroring technology.)
(Mooted by DNS use of SURBL. DNS is a very efficient distributed database.)
- Distributing the entire database to every client with
periodic incremental updates
(Mooted by DNS use of SURBL,
which of course distributes the database in the form of a zone file.)
- This is a useful start to show what this approach can do.
Even with the limitations, it clearly can be rather effective.
The approach could be developed further from here.
(And it has been.)
Notes
- This is an Experimental Service.
It may be subject to delay, unavailability and errors.
Use at your own risk.
We will, however make a reasonable,
voluntary effort to keep it running and accurate.
We also intend to publish all code openly for review and further development.
If the concept looks promising we may scale it up with multiple servers,
mirroring, round robin DNS load balancing, geographic clusters, etc.
- You may of course install the code for this server
and run it locally.
Some centralization and mirroring of server content may be useful
in order to cut down on the load on SpamCop's servers.
An alternate approach would be to serve up the entire database to
clients and push the processing of the database onto the clients.
That would probably maximally distribute the load,
but it makes the client requirements more complex.
- Timestamps are GMT (UTC), not local time.
- Timestamps in format YYYY-MM-DD HH:MM where
YYYY is the four digit year,
MM is the two digit month,
DD is the two digit day,
HH is the two digit hour (24 hour clock),
MM is the two digit minute.
Other formats are probably possible,
provided the components appear with all digits and in this order,
so that times can be easily sorted numerically.
- Seconds are deliberately not used because the system timing
accuracy may not be fine-grained enough to support their use.
- Results are sorted and uniqued based on each entries' URI or FQDN
and their one minute resolution timestamp.
Because of the uniquing operation, duplicate entries occurring
within the same minute are not shown. This means the data
cannot be used as a count of reports, only a confirmation that
a particular pattern was reported.
- However every unique FQDN or URI will be present, so the data
is still useful as a yes or no confirmation of prior reporting.
- Timestamp and URI or FQDN are separated by a tab, as the default from the
*NIX paste command.
- Sort order is most recent first.
- Even with the somewhat coarse one minute timestamp resolution,
some jitter may be apparent in the source data, causing the calculated
timestamps to occasionally wander around with slightly inconsistent results.
For a general URL detection use, this jitter in the data is probably
not a significant issue.
-
The FQDN version of the data may be better for matching purposes
since some of the deliberate entropy which is sometimes added is
ignored, such as keywords, subdirectories, etc.
But the host portion, usernames, etc are still potentially
problematic.
-
Searches, counts and percentages of URIs and FQDNs
won't match exactly since the URIs tend to be more unique.
Being more generalized views of related source data means
the FQDNs may be more generally useful.
I would therefore probably recommend using the FQDN percentages for
spam-ranking systems.
-
Since the source data is dynamic, results will change over time.
The same query may not return the same result a few minutes later.
-
Additional algorithms can and probably should be applied to remove
more of the variations that can be deliberately
added to domain names while still producing a working URL.
It would be nice to find a way to remove all but the registered domain
name, but that seems a non-trivial task, particularly given subdomains,
TLDs that are more than one dot-separated word,
geographic variation in TLDs, etc.
Techniques for extracting registered domains are probably extant.
(Note that this is not saying "how do you look up a domain at a registrar."
The problem is extracting the actual domain name from all the muck.
Humans do it trivially easily. Machines less trivially so.
Yet I'm sure it can be done.)
- Source for the csh shell script that grabs and processes the
data is
shown here.
Being a shell script calling mostly compiled C programs, it should be
a fairly efficient system, though I make no specific claims to that end.
Improvement is often possible.
- Source data is folded to lower case immediately after it's gotten.
This is due to the limitation that the sed here
cannot seem to do case insensitive searches.
This could affect homemade client side pattern matching,
which should be made case insensitive.
- We are re-writing this code to work in a faster, more streamlined
way, but we will keep this site up as a reminder of an older, but
functional system.
Version 1.11 by Jeff Chan on 4/6/04