Yesterday one of the Bronco team wrote an interesting post on the fact Google Crawler was possibly following 301 to Robots.txt file even if it was on a separate domain!

At the time I first double checked our own bots don’t do anything quite so stupid before suggesting that I thought it unlikely but would happily test it. Dave suggested a wager oddly enough one I never took him up on and I’m glad I didn’t!

How we crawl robots.txt file

The need for speed is paramount when crawling a site, a bot is taking up server resources and you want it to complete its required action in as short a point as possible. If your bot follows REP (what’s REP See Post notes for details )it’s first action should be to download the available robots.txt file, on average these files are 2-4kb in size very small and take no time to download, however a file sent with a 404 is closer to 22-40kb assuming it also sends the associated html. a much larger size given the majority of sites do not have a robots.txt file this means if you are not careful your robot will spend more time downloading a useless file then anything else. The method we use is to simply ask initially for packets if the return is a status 200 we proceed to download the file, anything else and the status is stored and is ignored.

Is the way you do your crawler the correct way Tim?

There is no “official” recommendation within the RFC governing REP that covers how you should treat status codes and which you should follow to only follow Status 200 is by far the most efficient method but it comes at a cost as you could be ignoring the file! It also doesn’t totally protect against downloading 404 pages as some servers send out a status 200 not 404 when a page can not be found.
A draft proprosal did suggest that other status codes should be followed including 3xx related to moved documents temporary or permanent it did not explicitly mention dealing with cross domains.
I have started to make changes to our own bots See Post notes for details

How does Google deal with cross domain 301 of a robots.txt file?

It reads the file at least according to webmaster tools, in Bronco follow up post they show Google Webmaster tools accepting bit.ly/robots.txt file nice should we be alarmed potentially though only if your allowing custom urls to your user at a root level on your domain with dots in them so if your running a URL shortener then yes perhaps something to check.

Did you do your own tests?

yes I had already done some tests last night which backed up what they did here is how I tested.

Experiment 1 Cross 301 oh please say this doesn’t work!
We created two domains domain A and domain B with a robots.txt on Domain A and 301 to a file on Domain B the robots.txt dissallowed access to /test/ folder, a test folder was put on both domains and index file was put in both, each domain was given a root index cross referencing each and each of the test files.

If Google crawled the robots.txt then Domain A should have 1 indexed page, Domain B 2 when finished.
with a monitor attached to the logs doing reverse DNS looking for a Google IP so we could watch the interaction some links were thrown at Domain A.

Result: Domain A1 page indexed, Domain B2 page Indexed

In Webmaster Tools a status 200

Experiment 2 – Let’s give google the benefit of the doubt
Ok so maybe they have indeed adopted the 1997 draft and are therefore obeying redirects it will ignore a Status 666 right?
Fresh domain this time our robots.txt file will be in the correct location but will send a http status of 666

Result: Domain Aindexed 1 page

In Webmaster Tools – status 200

Experiment 3 – given you a robots.txt file regardless
Ok so what if we tell you our server is broken i.e 500 but we give you a correct robots.txt file?
Fresh domain, correct location but headers sent are http 503 – Service Unavailable we are telling it we are not available the server is buggered in effect.

Result: Domain Aindexed 1 page

In Webmaster Tools – status 200

Tim – If you think about this it actually supports the belief google have actually programmed in the ability to follow 3xx as otherwise it would have for the 3xx returned a 404 or a 200 and blank file

Experiment 4 – I’m not here even though I’m here
Final test send http status 404 but also a valid robots.txt file what you going to do Google!

Result: Domain Aindexed 2 page

In Webmaster Tools – status 404

Only in the final test did Google behave as if it was paying the blindest notice to http status codes, can we assume 404 is hard coded and it will accept anything else?

Why should you care?

While the potential for abuse is small unless you run something akin to a URL shortner what happens when your site is producing an intermittent 500 error. From playing with status codes it would seem Google if shown a invalid or unreachable robots.txt will continue to use the old file could this be a potential for abuse what about a sneaky redirect only google on a 301 from your robots.txt by a very mischievous hacker. food for thought, and I’m glad I didn’t take that bet of with DaveN.

 

Link Building CompanyLink building ServiceHire A Link Builder

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks