Learn how Google discovers, crawls, and serves web pages
When you sit down at your computer and do a Google search, you're almost instantly presented with a list of results from all over the web. How does Google find web pages matching your query, and determine the order of search results?
In the
simplest terms, you could think of searching the web as looking in a very large
book with an impressive index telling you exactly where everything is located.
When you perform a Google search, our programs check our index to determine the
most relevant search results to be returned ("served") to you.
The three
key processes in delivering search results to you are:
ü Crawling: Does Google know about
your site? Can we find it?
ü Indexing: Can Google index your
site?
ü Serving: Does the site have good
and useful content that is relevant to the user's search?
Crawling
Crawling
is the process by which Googlebot discovers new and updated pages to
be added to the Google index.
We use a
huge set of computers to fetch (or "crawl") billions of pages on the
web. The program that does the fetching is called Googlebot (also known as a
robot, bot, or spider). Googlebot uses an algorithmic process: computer
programs determine which sites to crawl, how often, and how many pages to fetch
from each site.
Google's
crawl process begins with a list of web page URLs, generated from previous
crawl processes, and augmented with Sitemap data provided by webmasters. As
Googlebot visits each of these websites it detects links on each page and adds
them to its list of pages to crawl. New sites, changes to existing sites, and
dead links are noted and used to update the Google index.
Google
doesn't accept payment to crawl a site more frequently, and we keep the search
side of our business separate from our revenue-generating AdWords service.
Indexing
Googlebot
processes each of the pages it crawls in order to compile a massive index of
all the words it sees and their location on each page. In addition, we process
information included in key content tags and attributes, such as Title tags and
ALT attributes. Googlebot can process many, but not all, content types. For
example, we cannot process the content of some rich media files or dynamic
pages.
Serving results
When a
user enters a query, our machines search the index for matching pages and
return the results we believe are the most relevant to the user. Relevancy is
determined by over 200 factors, one of which is the PageRank for a
given page. PageRank is the measure of the importance of a page based on the
incoming links from other pages. In simple terms, each link to a page on your
site from another site adds to your site's PageRank. Not all links are equal:
Google works hard to improve the user experience by identifying spam links and
other practices that negatively impact search results. The best types of links are
those that are given based on the quality of your content.
In order
for your site to rank well in search results pages, it's important to make sure
that Google can crawl and index your site correctly. Our Webmaster
Guidelines outline some best practices that can help you avoid common
pitfalls and improve your site's ranking.
Google's Did
you mean and Google Autocomplete features are designed to help
users save time by displaying related terms, common misspellings, and popular
queries. Like our google.com search results, the keywords used by
these features are automatically generated by our web crawlers and search
algorithms. We display these predictions only when we think they might save the
user time. If a site ranks well for a keyword, it's because we've
algorithmically determined that its content is more relevant to the user's
query.
Googlebot
Googlebot
is Google's web crawling bot (sometimes also called a "spider").
Crawling is the process by which Googlebot discovers new and updated pages to
be added to the Google index.
We use a
huge set of computers to fetch (or "crawl") billions of pages on the
web. Googlebot uses an algorithmic process: computer programs determine which
sites to crawl, how often, and how many pages to fetch from each site.
Googlebot's
crawl process begins with a list of webpage URLs, generated from previous crawl
processes and augmented with Sitemap data provided by webmasters. As
Googlebot visits each of these websites it detects links (SRC and HREF) on each
page and adds them to its list of pages to crawl. New sites, changes to
existing sites, and dead links are noted and used to update the Google index.
For webmasters: Googlebot and your site
How Googlebot accesses your site
For most
sites, Googlebot shouldn't access your site more than once every few seconds on
average. However, due to network delays, it's possible that the rate will
appear to be slightly higher over short periods. In general, Googlebot should
download only one copy of each page at a time. If you see that Googlebot is
downloading a page multiple times, it's probably because the crawler was
stopped and restarted.
Googlebot
was designed to be distributed on several machines to improve performance and
scale as the web grows. Also, to cut down on bandwidth usage, we run many
crawlers on machines located near the sites they're indexing in the network.
Therefore, your logs may show visits from several machines at google.com, all
with the user-agent Googlebot. Our goal is to crawl as many pages from your
site as we can on each visit without overwhelming your server's bandwidth. Request
a change in the crawl rate.
Blocking Googlebot from content on your site
It's
almost impossible to keep a web server secret by not publishing links to it. As
soon as someone follows a link from your "secret" server to another
web server, your "secret" URL may appear in the referrer tag and can
be stored and published by the other web server in its referrer log. Similarly,
the web has many outdated and broken links. Whenever someone publishes an
incorrect link to your site or fails to update links to reflect changes in your
server, Googlebot will try to download an incorrect link from your site.
If you
want to prevent Googlebot from crawling content on your site, you have a number
of options, including using robots.txt to block access to files and
directories on your server.
Once
you've created your robots.txt file, there may be a small delay before
Googlebot discovers your changes. If Googlebot is still crawling content you've
blocked in robots.txt, check that the robots.txt is in the correct location. It
must be in the top directory of the server (e.g., www.myhost.com/robots.txt);
placing the file in a subdirectory won't have any effect.
If you
just want to prevent the "file not found" error messages in your web
server log, you can create an empty file named robots.txt. If you want to
prevent Googlebot from following any links on a page of your site, you can use
the nofollow meta tag. To prevent Googlebot from following an individual
link, add the rel="nofollow" attribute to the link itself.
Here are
some additional tips:
- Test that your robots.txt is
working as expected. The Test robots.txt tool on the Blocked
URLs page (under Health) lets you see exactly how
Googlebot will interpret the contents of your robots.txt file. The Google
user-agent is (appropriately enough) Googlebot.
- The Fetch as Google tool in Webmaster Tools helps you understand exactly how your site appears to Googlebot. This can be very useful when troubleshooting problems with your site's content or discoverability in search results.
Making sure your site is crawlable
Googlebot
discovers sites by following links from page to page. The Crawl errors page
in Webmaster Tools lists any problems Googlebot found when crawling your site.
We recommend reviewing these crawl errors regularly to identify any problems
with your site.
If you're
running an AJAX application with content that you'd like to appear in search
results, we recommend reviewing our proposal on making AJAX-based content
crawlable and indexable.
If your
robots.txt file is working as expected, but your site isn't getting traffic,
here are some possible reasons why your content is not performing well in
search.
Problems with spammers and other user-agents
The IP
addresses used by Googlebot change from time to time. The best way to identify
accesses by Googlebot is to use the user-agent (Googlebot). You can verify
that a bot accessing your server really is Googlebot by using a reverse
DNS lookup.
Googlebot
and all respectable search engine bots will respect the directives in
robots.txt, but some nogoodniks and spammers do not. Report spam to
Google.
Google
has several other user-agents, including Feedfetcher (user-agent Feedfetcher-Google).
Since Feedfetcher requests come from explicit action by human users who have
added the feeds to their Google home page or to Google Reader,
and not from automated crawlers, Feedfetcher does not follow robots.txt
guidelines. You can prevent Feedfetcher from crawling your site by configuring
your server to serve a 404, 410, or other error status message to user-agent
Feedfetcher-Google. More information about Feedfetcher.
