![]() | Ned Batchelder : Blog | Code | Text | Site Stopping spambots with hashes and honeypots » Home : Text |
Created 21 January 2007 Spam sucks. Any site which allows unauthenticated users to submit forms will have a problem with spamming software (spambots) submitting junk content. A common technique to prevent spambots is CAPTCHA, which requires people to perform a task (usually noisy text recognition) that would be extremely difficult to do with software. But CAPTCHAs annoy users, and are becoming more difficult even for people to get right. Rather than stopping bots by having people identify themselves, we can stop the bots by making it difficult for them to make a successful post, or by having them inadvertently identify themselves as bots. This removes the burden from people, and leaves the comment form free of visible anti-spam measures. This technique is how I prevent spambots on this site. It works. The method described here doesn't look at the content at all. It can be augmented with content-based prevention such as Akismet, but I find it works very well all by itself. Know thy enemyBy watching how spammers fail to create spam on my site, there seem to be three different types of spam creators: Playback spambots, form-filling spambots, and humans. Playback botsThese are bots which have recorded POST data which they replay back to the form submission URL. A person visits the form the first time, and records the form data. Certain fields are marked as slots to be filled in with randomized spam later, but the structure of the form is played back verbatim each time. This includes the names of the fields, and the contents of hidden fields. These bots don't even bother looking at the form as served by the site, but blindly post their canned data to the submission URL. Using unusual field names to avoid these bots will only work for a week or so, when they will then record the new field name, and begin posting with it. A playback bot can be stopped by varying the hidden data on the form so that it will not be valid forever. A timestamp is a simple way to do this, making it possible to detect when old data is being replayed. The timestamp can be made tamper-proof by hashing it with a secret and including the hash in the hidden data of the form. Replaying can be further hindered by including the client's IP address in the hash, so that data can't even be immediately replayed across an army of spambots. Form-filling botsThese bots read the form served by the site, and mechanically fill data into the fields. I don't know if they understand common field names (email, name, subject) or not. On my site, I've observed bots that look at the type of the field, and fill in data based on the type. Single-line edit controls (type=text) get name, email, and subject, while textareas get the comment body. Some bots will fill the same data into all the fields of the same type, while others will enter (for example) different first names into each of the single-line fields. Form-filling bots can be stopped by including editable fields on the form that are invisible to people. These fields are called honeypots and are validated when the form data is posted. If they contain any text, then the submitter must be a bot, and the submission is rejected. Using randomized obscured field names, and strict validation can also stop these bots. If the email field must have an @-sign, and the name field must not, and the bot can't tell which field is email and which is name, then the chances it will make a successful post have been greatly reduced. HumansThese are actual people using your form. There's nothing you can do to stop them, other than to remove the incentive. They want link traffic. Use the rel="nofollow" attribute on all links, and be clear that you are doing it. Building the bot-proof formThe comment form has four key components: timestamp, spinner, field names, and honeypots. The timestamp is simply the number of seconds since some fixed point in time. For example, the PHP function time() follows the Unix convention of returning seconds since 1/1/1970. The spinner is a hidden field used for a few things: it hashes together a number of values that prevent tampering and replays, and is used to obscure field names. The spinner is an MD5 hash of:
The field names on the form are all randomized. They are hashes of the real field name, the spinner, and a secret. The spinner gets a fixed field name, but all other fields on the form, including the submission buttons, use hashed field names. Honeypot fields are invisible fields on the form. Invisible is different than hidden. Hidden is a type of field that is not displayed for editing. Bots understand hidden fields, because hidden fields often carry identifying information that has to be returned intact. Invisible fields are ordinary editable fields that have been made invisible in the browser. The invisibility of the honeypot fields is a key way that bots reveal themselves. Because bots do not process the entirety of the HTML, CSS, and Javascript in the form, and because they do not build a visual representation of the page, and because they do not perceive the form as people do, they cannot distinguish invisible fields from visible ones. They will put data into honeypot fields because they don't know any better. The form is built as usual, including:
Processing the post dataWhen the form is posted back to the server, a number of checks are made to determine if the form is valid. If any validation fails, the submission is rejected. First the spinner field is read, and is used to hash all of the real field names into their hashed counterparts so that we can find data on the form. The timestamp is checked. If it is too far in the past, or if it is in the future, the form is invalid. Of course a missing or non-integer timestamp is also a deal-breaker. The value of the spinner is checked. The same hash that created it in the first place is re-computed to see that the spinner hasn't been tampered with. (Note that this check isn't actually necessary, since if the spinner had been modified, it wouldn't have successfully hashed the timestamp field name and the timestamp verification would already have failed, but the extra check is harmless and reassuring.) Check the honeypots. If any of them have any text in them, the submission is rejected. Validate all the rest of the data as usual, for example, name, email, website, and so on. At this point, if all of the validation succeeded, you know that you have a post from a human. You can also apply content-based spam prevention, but I have not found it to be necessary. Making honeypots invisibleThis is the essence of catching the bots. The idea here is to do something to keep the honeypot fields from being visible (or tempting) to people, but that bots won't be able to pick up on. There are lots of possibilities. As you can see from looking at my comment form, I've simply added a style attribute that sets display:none, but there are lots of other ideas:
CriticismsLet me address a few common criticisms. DefeatabilityIn theory, it is possible for a spambot to defeat any of these measures. But in practice, bots are very stupid, and the simplest trick will confuse them. Spam prevention doesn't have to make it theoretically impossible to post spam, it just has to make it more difficult than most of the interesting forms on the internet. Spammers don't make software that can post to any form, they make software that can post to many forms. A relevant joke:
In any case, yes, spammers may eventually write spambots sophisticated enough to navigate honeypots properly. If and when they do, we can switch back to CAPTCHAs. In the meantime, honeypots work really well, and there are lots of ways to make them invisible we haven't even needed to use yet. AccessibilityUsers that don't use CSS or Javascript will be exposed to all of the honeypot fields. A simple solution is to label the fields so that these users will leave them untouched. As long as no text is entered into them, the form will submit just fine. It worksThis technique has been keeping spam off my site for a year now, and works really well. I have not had problems with false positives, as Akismet has had. I have not had problems with false negatives, as keyword-based filtering has had. Spambots may get more sophisticated, but their software complexity will have to increase orders of magnitude before they can break this method. See also
| |
Comments
Hi Ned, a great recollection of anti-span tactics, thank you.
One thing that you didn't mention and that I consider valid is checking that a minimum interval of time passes between the request of the page containing the form and its submission. Humans just won't (or shouldn't) spend less than, say, 15 seconds reading or browsing a page before submitting content via a form, while bots don't need that time at all.
So checking a 'form-generated-at' timestamp vs. a 'form-submitted-at' one and rejecting the post when those are too close makes a good bot detection method. What do you think?
Mort, that is an excellent idea. An entire area I haven't explored here is verifications like yours that confirm the typical human behavior: read the posting, load the comment form, type a comment, then post.
Very good idea, one more that could help with human spam/bad comments in general. Add a question/combo box asking what the blog post was about. For example the choices could be: timestamps, preventing blog spam, or equilateral triangles. A human would actually need to read some of the post to get it right, and a spam bot would only have a 1 in 3 chance.
It's not a bad as a CAPTCHA either since it actually relates to the content they are commenting on, and would require human spammers to at least read some of the post slowing down that rate of spam.
I was recently highly depressed by spambots. I unveiled a new, custom-written set of forum software on my site, and enabled posting without registration for a period. Within hours of this happening, the spambots appeared and started doing their thing -- on a form that they'd never seen before, even bypassing a hash system similar to one documented here.
I ended up turning off unregistered user posting, but I was really, really irritated by the experience.
Good food for thought in this post!
Here's my own logic: If we can make it (near?) impossible for human spammers to suceed, we can happily forget about spambots, as the bot problem would be solved. So perhaps we need to focus on making sure that humans who enter comments do so out of legitimate motivation.
The multiple choice method mentioned above is a strong pointer in that direction.
A nice twist to mort's delay method would be to not let the user know how much delay is required in error messages. If a user knows it's 15 seconds he's got something systematic to work with (in a way we might not want). Also, I'd suggest randomly varying the delay time for each request.
How would you integrate these techniques with something like the xml validation used with this (or variations of this) technique: http://peter.mapledesign.co.uk/weblog/archives/category/questsformprocessor/
JF: My goal here is to avoid any effort by the people I want to encourage: commenters. If we have to use those techniques, then your idea is interesting, and there are many other semantic-based CAPTCHAs that can be explored.
Ray: I fear that there is no way to truly separate good people from bad people. If a human spammer wants to post a comment, they will be able to pass whatever Turing test we pose for them. Content-based filters like Akismet can help here, but that will be a very tough nut to crack.
Chris: I haven't used an existing forms package for this site, so I don't know what would be involved to adapt it for these techniques. I imagine the field hashing and unhashing could be performed as a last step and first step wrapped around a standard forms library, but I don't really know.
I had only a couple forms to validate on the Musi-Cal website. I used SpamBayes to filter the submissions. This had the added advantage of flagging otherwise valid submissions where users had mistyped something (bad date, misspelled city, etc).
Skip
I was interested in putting this through 508 validation. Somewhat ironic, upon first look it fails due to INPUT elements requiring to contain the alt attribute or use a LABEL, not the fact that you cant see the fields. Very interesting Ned.
One strategy is to have the form processing program (1) capture the IP, then (2) try to send a confirmation email to the poster, and
(3) if the email address is invalid, switch to displaying a "hit the back button and check your email address" message; this stops bots and people who give a ficticious email address when they post...
For sites that require the users to have JavaScript enabled, a good way of stopping spambots could be to require that they have a full JS implementation, by presenting a challenge in the form of a javascript function that should be run in order to get some number, which should be included in the response.
The function should be randomized, e.g. using different constants, loops, math operations, so it would require the spambot to evaluate it every time.
The spambots based on e.g. firefox+greasemonkey would make it relatively easy to break this protection, but even then the spammers would need more resource (JS is not exactly fast) for spamming than they do now. Or so I hope.
While I like your methods, they might cause trouble to disabled people and to those with strange browsers.
In my opinion Javascript is best disabled - I use the Firefox NoScript plugin for this.
Also I think that your approach only works if the robot's screen is different from the user. I'm really tempted to try a script in Autohotkey (www.autohotkey.com) that does the following:
1. Search the site for the comment section "add a comment:",then "name:", "email:","www:".
2. Then go back to "name:" and move the mouse pointer 300 pixels to the right and click.
3. Enter name, TAB, email, TAB, TAB,Spam text,TAB,Enter
4. Reset DSL-Line for new IP
5. Reset Cookies
6. Reload your page
7. Wait 10 seconds
8. Goto #1
I wouldn't be surprised if the comment list would fill up fast.
As bots don't know JavaScript another method would be to use captcha validation only if JavaScript is disabled.
Hello,
First of all, congratulations for the strategy, I am using something similar, but quite simpler, and now will improve it.
One question: could you please explain what is the purpose of the PHPSESSID hidden field? I doubt that you are using it, because it would mean one cannot post if session cookies are disabled or if the session has expired. So what is the benefit of this field?
Thank you.
----
P.S. Well well well -
"You took a long time entering this post. Please preview it and submit it again."
This actually means that everyone who posts a comment after spending some time reading thoroughly your article will have to submit twice. Why not move the slider more to the usability side, after all, website are made FOR the people, not AGAINST spam bots.
In addition, please make the page position itself on the submitted comment or on the unsubmitted form after posting.
Best regards and good luck!
Dimo
Thanks for the article. I've been protecting my contact form with a fairly crude method that adds a random number to a hidden field, and passes the same number in a cookie. I assumed the spambots wouldn't use cookies so would be filtered out. Until this week it worked fine, but now I've started to receive spam e-mails again. I'm going to replace the random number with a hashed version, and I might try a honeypot field too.
Outstanding article. I have a (widely distributed) stand-alone and WordPress contact form that uses the techniques you describe as well as a few others. I call them all spam traps collectively, but I do like the names you've given them. :)
Interesting article, I'd never thought to use a honeypot technique (I came via the WSG list by the way)... I've had, like everyone, battles with spammers both on my and my clients sites and have avoided captchas like the plague. Thanks for posting this, greatly appreciated.
Another way to hide text fields is to just surround them with comment tags. On my guestbook form page I set a cookie using javascript then retrieve the cookie during form processing in the cgi script. If the cookie is not found you get an error message stating javascript and cookies both need to be enabled to post. Setting it with javascript will eliminate most of the non human posters not to mention most bots can't store cookies.
All great ideas to keep spambot form submissions from going through, but how to stop their endless attempts from spoofed and randomized IP addresses that can bring a server down?
All great ideas to keep spambot form submissions from going through, but how to stop their endless attempts from spoofed and randomized IP addresses that can bring a server down?
This is a great article, thank you for the insight!
I still don't get it. I'd like to use the hidden fields via css. Where can I find more info on exactly how to do this? There has to be more to it than display:none ? How does the form not get submitted? We're bombarded with form spam to the point that I had to remove the form. Thanks for the article!
Thanks, very nice introduction. Meanwhile, I'd like to share my solution, a smart textual captcha:
Advanced Textual Confirmation
http://bbantispam.com/atc/
It works very well, easy to installation and no complaints from the visitors.
Two tricks I use on my site's contact form - they should work on a comment form too.
1. when the form page is served a token is created together with a timestamp - these are stored in a text file. When a submission is received the system checks:
a) does the token exist
b) was the submission too slow or quick for a human
2. I reject submissions containing the string "http://" and ask the user to remove those from the web addresses.
I've been having to deal with these problems a lot recently (and having to make 508 compliant sites makes captchas out of the question), and I have come to similar conclusions on my own. Actually, nearly identical. Strange. Anyhow, my forms now create a hash of each 'real' field name, pass the decryption key in the session to the submission script, decrypt the field names, check that the invisible honey pot fields, which either contain a random string or nothing, are not molested between the form and the submission script, replaced all field keywords (i.e. email, comments, etc.) with the HTML numeric character equivalents (i.e. P for P...not quite sure if bots scan for keywords near fields though), scan for common spam words and HTML, and only then, if all of these conditions are met, will the form submit.
Another thought I had was using javascript requiring very simple human interaction - focusing or clicking on the page would pull all of the form fields from a linked (but dynamically generated) javascript file using 'document.write' to output the code to the page. That would require that not only the spambot be capable of running javascript, but also that it would have to download all linked .js files to the page, and actually perform an interaction with the page to write out the randomly generated form fields.
one thing you are doing thats an extra check, is by checking the timestamp value.
"The timestamp is checked. If it is too far in the past, or if it is in the future, the form is invalid. Of course a missing or non-integer timestamp is also a deal-breaker."
If you have hashed the timestamp in the main hash, if someone screws around with the hidden timestamp field, it will never match the created hash.
So while you can still check how long it has been since that hash was generated, it would be impossible for someone to generate a future hash without knowing your secret/hash parts.
Excellent idea!
Things I noticed on this blog are that honeypots of this blog do not seem to have:
- the value intialized.
- the id.
So, if they have id='foo' and value='bar', writing a spambot to detect honeypots on this site would be even more difficult.
Here's a honeypot/spamtrap CGI program I use on some pages:
----
#!/bin/bash
#
# Anyone who requests this CGI gets added to the "Deny from" line in
# two files, .htaccess and .htaccess_recent.
#
# First put this CGI in your robots.txt file to prevent legitimate spiders from
# finding it. You should wait about a day since many spiders (such as Google)
# only request robots.txt once per day.
# Next, put empty links to it in various HTML pages to trick
# bad spiders into tripping it.
# After you've verified that it's working correctly, you can change
# the htaccess file below to point to a real .htaccess file that blocks
# sites.
#
# Use a cron job to run spamtrap_rotate_lists.sh to periodically move .htaccess_recent over .htaccess and recreate
# it as an empty htaccess file. This prevents .htaccess from filling up
# with old addresses.
# You must edit spamtrap_rotate_lists.sh to set the correct directory
# and file names.
#
file="htaccess" #change this to the real one.
function error() {
echo "
File error adding entry to $1.
"exit
}
echo "Content-type: text/html"
echo
echo ""
if grep -q "$REMOTE_ADDR" ${file}
then
echo "
Already in the list.
"exit
fi
sed -i "s/Deny from.*/& ${REMOTE_ADDR}/" ${file} 2>&1 || error "${file}"
sed -i "s/Deny from.*/& ${REMOTE_ADDR}/" ${file}_recent 2>&1 || error "${file}_recent"
echo "
Your address $REMOTE_ADDR will now be blocked from this site. This is a trap for automated spiders that do not honor the robots.txt file. Email "webmaster" at this site for details.
"----
file rotation script (run monthly as a cron job):
----
#!/bin/bash
dir=/usr/lib/cgi-bin/spamtrap
file=${dir}/htaccess
file_recent=${file}_recent
file_default=${file}_default
mv ${file_recent} ${file}
cp ${file_default} ${file_recent}
----
htaccess_default:
----
# htaccess_defaut
# 88.151.114.* is webbot.org (webbot.ru)
# you could add more known spamming domains if you want:
Deny from 88.151.114.*
reCAPTCHA (http://recaptcha.net/) is an interesting project in this topic. From developer's point of view the best is that don't have to (re)invent your own CAPTCHA for your site, you can use it as a service many ways:
http://recaptcha.net/resources.html
I've just created a new possibility to use it's functionality:
http://code.google.com/p/mailhide-tag/
It is a JSP tag which helps developers to hide mail address from spambots.
Ned,
I think the methods you suggest are valid and for most, will work outstandingly.
Like with captcha, the main point is that we don't want to create traps into which existing bots fall, but traps which are completely unavoidable, even if you know the drill. That's a bit higher mathematics.
The problem is, that IF the spammer is intrested enough in your site, he/she WILL write a script to defeat all these ordinary methods. The point of captcha is to force the user make something that a computer just is not able to repeat, even if tried to teach it so.
Of course, the quest for a perfect captcha is still on. Google has done quite good, with almost 100% accuracy.
We just released a free (BSD license) blogging app for Django that implements honeypot spam prevention based on these ideas. See this post for details: http://blog.blogcosm.com/2007/12/06/developers-we-just-released-blogmaker-free-blogging-app-django/
Thanks for an interesting post.
I am currently doing battle with these things and have them beat for now with custom CAPTCHAS but...
Based on the info here I will have to take further measures in the future.
As my studio manages a number of sites I want my next solution to be comprehensive and more robust.
Very interesting post. Enjoyed thoroughly.
It's obvious in retrospect, but something that tripped me up is that this strategy won't work with cached pages. The page with the comment form has to be generated for each request with the proper timestamp and IP address. If the page is cached the spinner is not correct for the current user and the form submission looks like spam.
I have used a fairly simple technique to stop the bots. I have not had a single bot registration since implementing it.
After discovering the the bots are only registering at my site to promote a web site URL for search engine ranking, I have modified the 'register new user' script to barf an error message and fail the registration if the new user submits a web site URL. So far I have only had 2 people advise me of the problem when registering so I just update their web site URL manually. Very simple and very effective.
Very useful informations Ned.
I'm developing an alternative technique called Pictcha.
It is a protection in a form of an image retrieved from UTYP engine, which can be embedded inside Web Forms, and which will filter out various spams in a more user friendly way than the well known captchas.
http://nthinking.net/miss/pictcha.html (simple JavaScript implementation)
http://nthinking.net/miss/pictcha-sample.php (server side PHP implementation)
, there is also a PHP lib for server
And as a bonus, it is learning. Thus it recycles the tremendous waste of concentration which is the conventional Form Validation by text recognition.
You may check the lab page to get more details about it:
http://nthinking.net/miss/lab.html
I agree completely with Kent Johnson. There are now already some sites that don't work correctly when, for example, Opera is restoring the last session.
I also find this "submitting twice" thing annoying, as Dimo said. But perhaps not too annoying. At least this is better than sites that outright discard your comments when you submit, when you just took one hour to write your comment and have not saved it somewhere.
In all, I see this (and the current status of email spam) as that the spammers have already won. I believe the spammers' true intention is to destroy the web and email as a viable means of communications, and we have, hopefully only reluctantly, in fact helped them achieve their real goals.
I find rel="nofollow" to be somewhat rude to the commenters.
I have an idea I haven't tried yet: use rel="nofollow" for all new comments by default and remove it when the comment is marked as non-spam. With all that being clearly indicated.
Testing ...
Thank you very much for this article, I guess no one was taking this dimension into enough consideration yet.
We in Web-APP, will absolutely apply some of your tips.
Thanks
On
http://www.web-app.net
In the pursuit of blocking bots, we need to remember people with vision disabilities. (Well, at least if you write web sites for large businesses and governments, you do.) Visually impaired people may be using a screenreader which will "see" all the hidden traps you put out for bots. And if you don't use <label/> and <fieldsets/> tags and accessibility techniques like that, you're making life harder for some people. (And if you do improve accesibility, then the bots will find it easier to spam you, but is the purpose of your forms to solicit data from people or to block bots?) For anyone who's interested, Google on "section 508 compliance" or "accessible design."
One thing I did as an experiment that worked out very well is count the number of "http://" used in a single post. If more than a set amount (I use 3) then the post is marked as spam. Spammers seem to like to use an excessive amount of links. Maybe better to compare the number of links to the size of the post. I ended up using the bots by replacing the potential spam with a random message like "this site rocks!". Like I said, this overly simple technique worked out very well :)
Any chance you could post/send some of the key code for this? I've got a little hand-coded site and I am being inundated with spambots, despite my own feeble attempts to screen the posted content.
Captcha works great.
One person mentioned odd browsers.
In particular, I LIKE text-based browsers BECAUSE they don't display graphics, unreadable colour schemes, or have an expensive scripting engine.
It's just too bad that the (latest) HTML standards now pretty much require the DOM.
Time-based analysis may fail in cases where the form does things like time out, and the user is forced to copy&paste their comment. (I have had to do that for one site)
Add a comment: