Googlebot’s Javascript Interpreter: A Diagnostic

Warning: This is very old. I began writing this several months ago and just never published after some back and forth w/ Matt Cutts. Take with a grain of salt.

Over the past two weeks multiple respected bloggers in the search community have commented on the increasing abilities of Googlebot, especially following Google’s announcement that it can now handle some forms of AJAX. I have, admittedly, long believed that we over-estimate what Google and Googlebot are capable of, so I wanted to run a proper experiment to determine the exact capabilities of GoogleBot in reading and interpreting Javascript.

The Question

How sophisticated is Googlebot’s javascript interpretation and, more specifically, which Javascript functions can Google accurately interpret.

The Functions and Features Tested

  • Simple Variables: Can Google Understand Simple Variable Assignment such as “var foo = ‘test content’; document.write(foo); “
  • Simple Variable Concatenation: Can Google Interpret “var foo = ‘test content’; var foo += ‘ more ‘; document.write(foo); “
  • Simple Document.write();
  • Simple element.innerHTML();
  • Dummy Variables: We added this test in to make sure Google only indexes data that is printed to the page, and not every string randomly stored in a variable.

The Methods Tested

  • Inline: We tested javascript stored on the page
  • Included: We tested javascript in a simple include
  • Included behind Robots.txt: We tested javascript in an include blocked by Robots.txt

The Results

Inline Include Blocked
Variables Yes Yes Yes?
Concatenation Yes Yes Yes?
document.write() Yes Yes Yes?
element.innerHTML() Yes Yes Yes?
Dummy No No No
Total 5/5 5/5 5/5

Hold Your Breath

Everyone now is probably staring at the Red highlighted column that indicates Googlebot can and will interpret Javascript hidden behind a robots.txt exclusion. I took the time to verify this and check it with multiple sources before finally reaching out to the source of truth. Was Googlebot really ignoring robots.txt when considering Javascript includes?

In short, no.

Best Practices for Robots.txt and Javascript

First, let me state that it is likely that Google will at some point (if they don’t already) use blocked .JS and .CSS as a negative signal. While there are legitimate reasons for this, there is no easy way for Google to verify that the contents of a page using these tactics are not greatly modified by the blocked files. So, be careful.

That being said, Matt was kind enough to respond in great detail to my findings, and pointed out several things one should consider when blocking .JS files which ultimately resulted in false positives in my analysis:

  1. Give Your Robots.txt a Head Start: This makes a lot of sense, but most webmasters (myself included) handle the new content and robots.txt at the same time.

    “In an ideal world, you’d wait 12 hours just to be completely safe. Essentially, any time you make a new directory and block it at the same time, there’s a race condition where it’s possible we would fetch the test.js before we saw it was blocked in robots.txt. That’s what happened here.” – Matt Cutts

    It is certainly untenable for Googlebot to check the Robots.txt with every new file downloaded on your site, so giving that head start can make a big difference.

  2. User-Agent Directives can Override One Another: This one was new to me, but it does make sense. If you begin with a generic “User-Agent: *” directive, and follow up with a specific directive, “User-Agent: Googlebot”, the latter overrides the former in terms of Googlebot, it does not append to it.

    If you disallow user-agent: * and then have a disallow user-agent: Googlebot, the more specific Googlebot section overrides the more general section–it doesn’t supplement it. – Matt Cutts

  3. Robots.txt is only Respected Up to 500,000 Characters: I know this is a pretty big number, but if you have a lot of unique URLs to block, it can get messy. This is particularly frustrating with the Google Webmaster Tools Robots.txt checker, which only analyzes the first 100,000.
  4. To Be Certain, Use the X-Robots-Tag: There is a great writeup here on how to use the HTTP Header X-Robots-Tag to indicate to Google that any file and filetype should not be indexed. Because this header is sent along with the file, Googlebot can respect it in real-time.
  5. .JS Files can Be Slow to Clear from Index: As is the case with any lower-priority crawled document, .JS files can take a while to clear Google’s index if for some reason Google finds the blocked .JS.

    The crawl team said that once a .js file has been fetched, it can be cached in our indexing process for a while. – Matt Cutts

    This is certainly not an understatement. The .JS indexed 2 weeks ago is still present on pages that were indexed before Googlebot realized the exclusion. I believe, though, that you can always use the emergency removal tool if this happens.

Re-Running the Test

Of course, after hearing back from Matt, I needed to re-run the blocked .JS test to confirm. Sure enough, now that the .JS file was behind a previously-established blocked directory, Googlebot respected the disallow. (Also, just to be careful, I tested it on a separate domain with which Matt was not familiar, so I can assure you there was no trickery involved).

Take Aways

  1. On Javascript: Google is actually interpreting the Javascript it spiders. It is not merely trying to extract strings of text and it does appear to be nuanced enough to know what text is and is not added to the Document Object Model. This is impressive.
  2. On Experimenting: Confirm, retest, ask, retest, confirm, confirm, write, confirm, revise, confirm, publish.
  3. On SEO: Learn new shit every day.

0 Comments

Trackbacks/Pingbacks

  1. Googlebot POSTS – using jQuery | devioblog - [...] Googlebot’s Javascript Interpreter: A Diagnostic [...]

Submit a Comment

Your email address will not be published. Required fields are marked *