Google’s Crawl

By: stefano

Google crawls the Web at varying depths and on more than one schedule. The so-called deep crawl occurs roughly once a month. This extensive reconnaissance of Web content requires more than a week to complete and an undis closed length of time after completion to build the results into the index. For this reason, it can take up to six weeks for a new page to appear in Google. Brand new sites at new domain addresses that have never been crawled before might not even be indexed at first.

If Google relied entirely on the deep crawl, its index would quickly become outdated in the rapidly shifting Web. To stay current, Google launches various supplemental fresh crawls that skim the Web more shallowly and frequently than the deep crawl. These supplementary spiders do not update the entire index, but they freshen it by updating the content of some sites. Google does not divulge its fresh-crawling schedules or targets, but Webmasters can get an indication of the crawl’s frequency through sharp observance.

Google has no obligation to touch any particular URL with a fresh crawl. Sites can increase their chance of being crawled often, however, by changing their content and adding pages frequently. Remember the shallowness aspect of the fresh crawl; Google might dip into the home page of your site (the front page, or index page) but not dive into a deep exploration of the site’s inner pages.

More than once I’ve observed a new index page of my site in Google within a day of my updating it, while a new inner page added at the same time was missing.) But Google’s spider can compare previous crawl results with the current crawl, and if it learns from the top navigation page that new content is added regularly, it might start crawling the entire site during its frequent visits.

The deep crawl is more automatic and mindlessly thorough than the fresh crawl. Chances are good that in a deep crawl cycle, any URL already in the main index will be reassessed down to its last page. However, Google does not necessarily include every page of a site. As usual, the reasons and formulas involved in excluding certain pages are not divulged.

The main fact to remember is that Google applies PageRank considerations to every single page, not just to domains and top pages. If a specific page is important to you and is not appearing in Google search results, your task is to apply every networking and optimization tactic described in Chapter 3 to that page. You may also manually submit that specific page to Google.The terms deep crawl and fresh crawl are widely used in the online marketing community to distinguish between the thorough spidering of the Web that Google launches approximately monthly and various intermediate crawls run at Google’s discretion.

Google itself acknowledges both levels of spider activity, but is secretive about exact schedules, crawl depths, and formulas by which the company chooses crawl targets. To a large extent, targets are determined by automatic processes built into the spider’s programming, but humans at Google also direct the spider to specific destinations for various reasons.

But two factors work against the index remaining unchanged for long. First, the frequency of fresh crawls keeps the index evolving in a state that Google-watchers call everflux. Second, some time is required to put crawl results into the index on Google’s thousands of servers. The irregular heaving and churning of the index that results from these two factors is called the Google dance.

Archives

LinkWithin

Related Posts Plugin for WordPress, Blogger...