Today at work we were discussing Search Engine Optimisation and wether or not search engines actually index content a long way down the page. I developed the “kottke matrix test-tube babies” test, this originated from me recalling a crazy post on Jason Kottke’s blog about the 2nd and 3rd Matrix movies. These had a ridiculous number of comments (457) and a lot of content, which I suspected would be beyond any reasonable content threshold a search engine may have implemented. I took a search term out of the final comment (test-tube babies) and a couple of terms to narrow the search (kottke matrix) and away I fired on the big search engines.
The results of the test, Google and Live are the only 2 that happen to return the page I was targeting. It’s interesting to see however, that Yahoo finds the Science page which contains the term “test tube” but not “babies”. This page is 180kb in size and “test tube” appears approximately half way down, whereas the Matrix Revolutions page is 646kb and test-tube appears at the very bottom.
What’s the moral to this story? Google and Live.com index atleast the first 646kb of a page, Yahoo indexes (at a minimum) around 100kb and Ask isn’t on the radar.
The whole concept of the no-follow tag is definitely interesting, it’s effectiveness is questionable and as far as comment spam goes, it hasn’t helped. (Thank God for Akismet) As far as it’s use in day to day operations of the web, it is somewhat un-spoken about, I have always thought that links that use the no-follow tag should be indicated as such, so the user can tell if the publisher of the website actually “believes” in that link. It is one thing to link somewhere, but another thing to say I actually respect the linked website enough that I will give it my vote. It’s somewhat like democracy really, it’s one thing to complain about the current administration or government, but you don’t really have that right unless you actually vote.
Anyway, there is a pretty broad assumption that search engines respect no-follow completely – never using that link in the Pagerank algorithm. Duncan Riley from TechCrunch suggests that it has pretty much solved the problem of people gaming Wikipedia which is feasible, but who is to say that Google actually ignores those links. Wikipedia’s content is generally very good and I think it is just far too good for any ranking algorythm to ignore.
We know there are 2.345 million variables (this number may be wrong) that Google uses to rank websites, so why wouldn’t it use links from Wikipedia as another indicator of relevance. The obvious problem that this approach encounters is verifying the integrity of the links. This could be done relatively accurately by time, if the link has been there for a month then include it in the index.
I would love to hear your thoughts on this.