14th November 2007

Google Indexing Versus Caching

By : Nidhi Gupta

The crucial difference between indexing and caching is that indexing means making something searchable and caching means reprinting content. Google’s library scanning program makes things searchable in Google Print but reprinted.

Indexing the words on a web page isn’t that much different than indexing the words on a printed page. If you publish a site, Google reads the whole site into its cache and then lets you find things in it. Generally people who publish sites know this, and want Google to do this.

Google Indexing

Google’s index and its cache are two different things, and it’s critical — absolutely critical — they not be confused like this.

When any search engine visits a web page, it effectively makes a copy of that page which is stored in the index. But the index literally breaks apart the page. It stores where words were located, were they in bold, what other words were they near, were the words in a hyperlink and so on.

Nothing in the index is anything you as a human being could read. Index may be described like a “big book of the web.” But it’s not, really. It’s more like a giant spreadsheet, where all the words of a page are in one row of the spreadsheet, each word to a different column, then the next page in the row below that, and so on. It’s not something a human being would read.

Google Cache

Aside from the index, Google, Yahoo, MSN and Ask Jeeves also make “cached” copies of pages available. You can see a copy of the exact page the search engine spidered. These cached pages are kept separate from the index. They are useful for when a page is down or for a copyright holder wants to see if someone has stolen and cloaked their content to feed to a spider. But the legality of showing such cached pages is also in question. No one today has challenged them in court. The reason seems to be that Google, which mainstreamed cached copies, lets site owners opt out of caching if they want.

All major search engines also let you opt out of being in their indexes, as well — a completely different thing — and another reason why the index shouldn’t be confused with the cache. To take Google as an example, you can:

  • Have your page listed in the index (available to be found through searches) and have your page available as a cached copy
  • Have your page listed in the index but not cached
  • Have your page NOT listed in the index and thus also not cached.

The ability to opt-out of the index is another reason why we really haven’t had a major search engine sued over web search indexing. In addition, site owners generally want to be indexed, so they can get traffic. In fact, the reason so many are upset over the current indexing update at Google is that they feel changes are causing them to lose traffic. But whether it is LEGAL to do this type of indexing (as opposed to caching) still really hasn’t been tested.

So indexing and caching are NOT the same. Dave writes:

“Google clearly does not have the right to make a copy of the book and republish it without the permission of or compensation to the copyright owner. The publishers appear to be on the right side of this one, and while I’m not a lawyer, I can’t imagine that they won’t prevail in court.”

Here’s the thing. Google is NOT, repeat NOT, republishing copies of books that it scans out of libraries. This is a fundamental mistake that many people seem to be making.

Google is scanning books into an index, just as it spiders web pages and adds them to its index. It is making the books searchable by doing this, but that process does not republish the books in a way you can read.

Think about it in web search terms. You can find a matching book, but there’s NO hyperlink to click on that will take you to an online version of the book itself. There’s just a snippet — maybe — of the text surrounding the words matching what you looked for.

Want the actual book? Google Print won’t give it to you. Instead, you have to go someplace and buy it or find it in a library. Google Print merely tells you the book may be what you’re looking for.

The only exception to this is if a publisher OPTS-IN. Not opt-out. If a publisher chooses, then — and only then for books that are in copyright — will Google display some of the actual book. The exact amount is left up to the publisher.

So, covered indexing means making a book (or web page) searchable while caching means making a page (or a book) viewable online, without having to go to the source material (the book or the page).

Spread the word: bookmark it/readit

This entry was posted on Wednesday, November 14th, 2007 at 1:49 am and is filed under SEO/Search Engine News. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

Spread the Word
delicious
digg
technorati
reddit
magnolia
stumbleupon
yahoo
google
  • Subscribe

  • Add to Google
  • Add to My Yahoo!
  • Subscribe with Bloglines
  • Subscribe in NewsGator Online
  • Add to Technorati Favorites!
  • Feedburner Reader
  • Get free E-Book on blogging

  • Online Marketing
  • RSS


eXTReMe Tracker