The Business Forum   

"It is impossible for ideas to compete in the marketplace if no forum for
  their presentation is provided or available."           Thomas Mann, 1896

The Business Forum Journal

 

Using XML Sitemaps

By Bruce Clay

 White Paper

Synopsis:

An XML Sitemap provides a bulk list of Web page URLs to the search engines, and is a fi le created according to a standard protocol. Webmasters can use XML Sitemaps to proactively alert search engines to the full depth of their Web sites. The site stands to gain by increasing the number of Web pages indexed, which in turn can lead to improved rankings for more keywords in the search engine results pages.

This paper covers the purpose and definition of XML Sitemaps, detailed instructions for creating and submitting them to the search engines, and information on the various types of specialized Sitemap files.

Purpose of an XML Sitemap

Making sure the search engines have a Web site in their indexes is the first essential step of any search engine optimization project. If the search engines don’t even know a Web site exists, then it’s futile to try to optimize it. That would be like trying to paint a picture without paint, or throw a dinner party without food. Job number one is to make sure the search engine spiders come to the Web site, crawl its pages and include them in the search engine’s index. And not just a few pages — having all of a site’s indexable pages included in the search engine’s index positively affects search engine rankings. Not only does the increased content allow for more potential matches to search queries, but also the search engines generally see more depth of content as a sign of greater expertise, and award higher search engine placement accordingly.

The usual way that search engine spiders find a Web site is by following an inbound link coming from another site. Once the spider arrives at the new Web page, it may continue exploring the other pages in the site by following the site’s internal navigation links. (This assumes that there is nothing blocking the search engines from crawling the site, such as an incorrect disallow command in the site’s robots.txt file or other.) However, waiting passively for the search engines to find the site is not the fastest or most reliable method to get indexed. What’s XML Sitemaps give webmasters an easy way to inform the search engines about their Web site pages.needed is a better way, a way that proactively feeds the information to the search engine spiders and invites them to come crawling.

The XML Sitemap was developed to answer this need. XML Sitemaps give webmasters an easy way to inform the search engines about their Web site pages. Google, Yahoo! and Bing, among other search engines, have all agreed to a common protocol for building an XML Sitemap, which is a convenience for webmasters. The official site containing the full protocol is http://sitemaps.org/

By enabling webmasters to alert search engines to their Web pages, XML Sitemaps can help those pages be indexed sooner and more completely.

The search engines do not guarantee they will index every link, but providing this information to them greatly improves a site’s chances. For SEO purposes, it is essential that site owners build a Sitemap and keep it updated. XML Sitemaps are especially valuable for:

  • New Web sites

  • Sites with few or no inbound links

  • Sites that launch a redesign or large-scale update

  • Sites that have only a small percentage of its pages indexed

  • Sites that have a large number of pages archived or not well linked together

  • Sites that cannot be easily crawled because they have dynamic content, a heavy use of Flash or AJAX, or poor internal navigation

XML Sitemaps Defined

The following excerpt from http://sitemaps.org/ explains what an XML Sitemap is in a nutshell: In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional Meta data about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

An XML Sitemap is merely a list of the pages on a Web site, with some optional Meta data about them that helps spiders crawl the site more intelligently. It is in XML (Extensible Markup Language) format, which is code rather than text — unreadable by most humans, but very efficient for search engine spiders.

An XML Sitemap can contain literally thousands of page URLs (up to 50,000 according to Google’s guidelines). Its purpose is to give the search engine a complete picture of the Web site so that it can be crawled more fully. The best use of an XML Sitemap is to include all of the pages the webmaster wants to be indexed on the site.

For SEO purposes, it is essential that site owners build a Sitemap and keep it updated.

Note: In addition to the general Sitemap discussed so far, Google has defined some specialized Sitemaps for mapping links to specific types of content such as news, video, and geospatial content.

These will be covered later under the heading “Building Specialized Sitemaps for Google.”

XML Sitemap vs. HTML Site Map

XML Sitemaps should not be confused with traditional HTML site maps. Often sites have a “Site map” link to a page that looks like a Web site table of contents. People can use this kind of site map to see how a site is organized and locate hard-to-fi nd pages, which is a function XML Sitemaps cannot offer. http://www.apple.com/sitemap/

An HTML site map:

  • Is primarily intended for the site’s human visitors, not spiders

  • Organizes the site’s contents into categories, rather than just as an unstructured list

  • Contains no more than 100 links (per Google’s “Design and Content

Guidelines” at http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35769

  • Contains links to important site pages only, due to the limited number allowed

  • Uses anchor text for links (as opposed to straight URLs)

  • Can pass PageRank (i.e., link popularity) through the links, as opposed to an XML Sitemap which passes none

  • Reinforces site themes by the way links are categorized, or for larger sites, by the way links are divided between multiple HTML site map pages

Because of these differences, a Web site ideally should have an HTML site map as well as an XML Sitemap. Though an HTML site map is primarily for users, it also benefits the site’s search engine optimization effort when search engine spiders crawl a site’s HTML site map page. Spiders can follow the links and index the important site pages, if any have been missed. But that’s not where the real SEO value lies.

The real SEO advantage of having an HTML site map is in the link anchor text and organization. An HTML site map shows a search engine spider what a site is really about. A search engine tries to discover every Web site’s themes and keywords so it can deliver relevant results to the people searching for those things. Links on a site map can contain the keywords that identify exactly what each page is about. A link such as “Hot Air Balloon Supplies” communicates a lot better than a URL path alone could.

So to summarize, a Web site should have both — an XML Sitemap for full site coverage, and an HTML site map for establishing themes — to provide search engines with the strongest, most complete understanding of the site.

Building an XML Sitemap

Since only search engine spiders see XML Sitemaps, the code can be extremely efficient. Formatting issues like font size and color are not a concern. The point is to feed the spiders a batch of page links as smoothly as possible, so they will crawl as much as possible through the site.

The way the XML code is written matters. To ensure that the Sitemap can be easily read by the majority of search engines, it should conform to the accepted Sitemap protocol http://sitemaps.org/

Google provides easy instructions on how to create a Sitemap based on the Sitemap protocol. There are two options: create it manually (explained at https://www.google.com/support/webmasters/bin/answer.py?answer=34657  or use an automatic Sitemap generator.

The real SEO advantage of having an HTML site map is in the link anchor text and organization.

Creating a Sitemap Manually

The format for creating a Sitemap fi le is not complicated. It is a text file saved with a .xml extension. After that’s done, filling in the list of URLs and optional Meta data follows a repetitive structure. Figure 2 shows a simple example that has only one page URL and some color-coding that will be explained below.

<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

<url>

<loc>http://www.example.com/</loc>

<lastmod>2009-03-21</lastmod>

<changefreq>monthly</changefreq>

<priority>0.8</priority>

</url>

</urlset>

 

Two lines of code must appear once at the top of the file by the orange bracket. They are:

<?xml version=”1.0” encoding=”UTF-8”?>

<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9”>

One line of code must appear at the very bottom of the file this code is as follows:

</urlset>

Between the top and bottom code are all of the URL entries. Each URL must contain, at a minimum, the following three lines:

<url>

<loc>http://www.yourdomain.com/</loc>

</url>

The <loc> tag identifi es the URL where the page is located. In addition, for every URL the XML Sitemap can provide three optional tags as follows:

<lastmod> The last modifi ed date for this page

<changefreq> How often the page changes

(e.g., hourly, daily, monthly, never)

<priority> How important the page is from 0 (lowest) to 1

(highest)

The code URL entry with all required and optional tags included. The search engines may consult the additional tags when deciding how often to spider a site, so using them provides another way to potentially improve its spiderability.

Auto-Generating an XML Sitemap

For large sites or for site owners who don’t want to type out all of their page entries manually, there is an easier way. Many Sitemap generators are available that will “spider” a Web site and then build the XML file automatically. Two services that are well respected are:

GSiteCrawler is available for free download from http://gsitecrawler.com/ and is widely used (for sites operating on Windows servers only).

Google Sitemap Generator is provided by Google. Officially still in beta, this script requires that the Web server have Python installed.

For more details, see ;
 https://www.google.com/support/webmasters/bin/answer.py?answer=34634&cbid=110jzemo1voyn&src=cb&lev=answer

Many additional third-party Sitemap generators can be found at
http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators.

No matter which Sitemap generator is selected, it’s important to set up the tool carefully. Proper settings will keep any pages that the site owner does not want indexed out of the Sitemap, and communicate the appropriate information in the Meta tags regarding how frequently the page changes (<changefreq>) and how important the page is (<priority>).

For an explanation of these Meta tags, see the previous section, “Creating a Sitemap Manually.”

Guidelines for Building an XML Sitemap

Certain guidelines must be followed when building an XML Sitemap. The list below is not exhaustive; see Google’s helpful page at http://www.google.com/support/webmasters/bin/answer.py?answer=34654&cbid=-591fmsq0cxzj&src=cb&lev=answer
 for additional information.

Size restrictions: A Sitemap file cannot have more than 50,000 URLs and cannot be larger than 10MB in size when uncompressed. Very large Web sites that would exceed these restrictions should divide their pages into smaller Sitemap files. The smaller files should be linked to from a single Sitemap that serves as a Sitemap index. Be aware that the maximum number of Sitemap files that any one Web site can have is 1,000. Recommendations for submitting either all of the Sitemap files or just the index fi le to the search engines are explained in the “Submitting an XML Sitemap” section.

For more information on multiple Sitemaps, see http://www.google.com/support/webmasters/bin/answer.py?answer=35654

URL syntax: Each URL must be a fully qualified link, which means it must spell out the complete Web site address that should be crawled. In addition, URLs listed in the Sitemap must be consistent in how they refer to the site path. For example, there should not be URLs starting with “http://www.yourdomain.com/” and others beginning with “http://yourdomain.com/

No image URLs: Direct image URLs should not be included in a Sitemap, since the search engines index the HTML page that the image appears on, not the image itself. No session IDs: If the site’s URLs include session IDs (a type of parameter that a content management system may append to a URL when delivering a requested page), they should be stripped out for the XML Sitemap.

Must be readable: The Sitemap must be readable by the Web server where the Sitemap is located, and may contain only ASCII characters. If the XML Sitemap contains Upper ASCII characters, certain control codes or special characters (such as * or {}), then the fi le will generate and error and not be read successfully.

Submitting an XML Sitemap

Once the Sitemap has been created, it should be saved in the root directory of the Web site (for an example see:
http://www.yourdomain.com/sitemap.xml. The next step is to invite the search engines to come and spider it. Since most Web sites are updated frequently with new pages and/or changed content, it is advisable to submit an updated XML Sitemap fi le regularly, as often as necessary to keep up with the frequency of updates.

There are three ways to point search engines to an XML Sitemap: robots.txt, HTTP request, or direct submission to the search engines. These may be used concurrently, if desired, to increase the chances that the search engine spiders will fi nd the Sitemap promptly. The following sections explain each method in detail.

Method 1: Robots.txt

Every Web site should place a text file named “robots.txt” in its root directory. Search engines begin crawling a Web site by reading the robots.txt file, because it gives search engines specific instructions regarding which sections of the site they should disallow, or not include, in their index. Webmasters can also direct spiders to the site’s XML Sitemap within this file.

A robots.txt fi le must follow rigid syntax rules in order to be compliant with the search engine spiders trying to read it. The command to specify an XML Sitemap to simply Sitemap: followed by the URL, which would look like this:

Sitemap: http://www.yourdomain.com/sitemap.xml

Note that the search engine Ask.com requires the robots.txt method for locating an XML Sitemap fi le. The other major search engines (Google, Yahoo! and Microsoft Bing) all support this method in addition to the ones listed below.

To read the complete instructions for writing robots.txt files, go to http://www.robotstxt.org

Method 2: HTTP Request

Another way to submit an XML Sitemap to a search engine is through an HTTP request. This is a technical solution that uses wget, curl, or another mechanism, and can be set up as an automated job that generates and submits an updated Sitemaps fi le on a regular basis.

The HTTP request would be issued in the following formats for each of the major search engines.

Google:

http://www.google.com/webmasters/tools/ping?sitemap http://www.yourdomain.com/sitemap.xml

Yahoo!:

http://siteexplorer.search.yahoo.com/webmasters/tools/ping?sitemap = http://www.yourdomain.com/sitemap.xml

Microsoft Bing:

http://www.bing.com/webmaster/webmasters/tools/ping?sitemap = http://www.yourdomain.com/sitemap.xml

The URL following the equal sign (=) in each request should identify the Sitemaps fi le location. For sites that have multiple Sitemaps connected with a Sitemap index file (as discussed in the “Guidelines for Building an XML Sitemap” section), only the Sitemap index file should be sent in the request.

A successful request returns an HTTP 200 response code, which indicates only that the Sitemap information was received, not that it was validated in any way.

For more information, see http://www.sitemaps.org/protocol.php#submit_ping

Method 3: Direct Submission

Google, Yahoo! and Bing offer ways for a site owner to submit an XML Sitemap to them directly. This is a proactive approach that can be used in addition to putting it in a robots.txt fi le or an HTTP request. Google: Submit a Sitemap to Google using Google Webmaster Tools at http://www.google.com/webmasters/tools/. Google Webmaster Tools is an incredibly valuable source of information that can help diagnose potential problems and provide a glimpse into the way Google views a Web site. The Google interface shows when Google last downloaded the Sitemap and any errors that may have occurred. Webmasters can validate their site, and also view information such as Web crawl errors (including pages that were not found or timed out), statistics (crawl rate, top search queries, etc.), and external and internal links. There are also other useful tools like the robots.txt analyzer.

Yahoo!: Submitting an XML Sitemap feed to Yahoo! simply requires entering the Sitemap’s URL through Yahoo! Site Explorer http://siteexplorer.search.yahoo.com/submit  Yahoo! made Sitemaps a little more confusing by introducing its own version that uses text fi les named urllist.txt. Many of the Sitemap generators also build a urllist.txt fi le simultaneously with the XML Sitemap feed. However, since Yahoo! also recognizes the Sitemaps protocol, it’s enough to just provide a standard XML Sitemaps and avoid having to update two files.

Bing: Microsoft has launched its own set of webmaster tools called Bing Webmaster Center http://www.bing.com/webmaster  This site is similar to Google’s, but not as robust. It allows webmasters to add an XML Sitemap feed and, after their site is validated, view information about their site. The information is currently limited to showing any blocked pages, top links, and robots.txt validation.

Specialized Sitemaps

Google now offers specialized XML Sitemaps for Video, Mobile, News, and Code Search (and expect more to be added in the future). These specialized XML Sitemaps allow a site owner to tell Google about particular types of content on their site — news articles, videos, pages designed for mobile devices, and publicly accessible source code on their Web site.

In turn, these special content pages may have a better chance of inclusion in Google’s specialized, vertical search engines.

Building a Video Sitemap

A Video Sitemap is useful for Web sites that contain videos. Submitting a Video Sitemap encourages Google to spider the site’s video content and hopefully rank it in video search results.

This format adheres to the standard Sitemap protocol, but includes additional video-specifi c tags. Once created, a Video Sitemap should be submitted to Google directly (see instructions under “Method 3: Direct Submission” above).

For an explanation of the special video tags and syntax required for a Video Sitemap, see Google’s article “Creating a Video Sitemap” http://www.google.com/support/webmasters/bin/answer.py?answer=80472

Special content pages may have a better chance of inclusion in Google’s specialized, vertical search engines.

All other company and product logos may be trademarks of the respective companies with which they are associated.

Building a Mobile Sitemap

Web sites that have pages developed specifically for mobile users can benefit from submitting a Mobile Sitemap to Google. This type of Sitemap uses the standard Sitemap protocol, but includes a specifi c <mobile> tag and an additional namespace requirement.

Google’s article “Creating Mobile Sitemaps” http://www.google.com/support/webmasters/bin/answer.py?answer=34648 gives all of the particulars and technical details related to building a Mobile Sitemap.

Building a News Sitemap

By definition, “news” requires immediate distribution in order to be effective. Any Web site that develops a large amount of news-type content needs search engines to index fresh content as quickly as possible. A News Sitemap provides a proactive way for such a site to have more control over what is submitted to Google News, since it can hasten the search engine’s discovery of new pages.

A News Sitemap communicates the individual URLs of news articles together with their publication date and keywords. It requires a second namespace in addition to the schema requirements of the standard Sitemaps protocol.

To read full details on creating and submitting a News Sitemap to Google, start with the “News Sitemaps: Overview” article http://www.google.com/support/news_pub/bin/answer.py?answer=75717 and follow the links from there.

Building a Code Search Sitemap

The last type of specialized Google Sitemap is called a Code Search Sitemap. People sometimes search for publicly accessible source code using the Google Code Search http://www.google.com/codesearch   Submitting a Sitemap geared for this vertical engine may be appropriate for sites that want to be found for that type of content.

Detailed instructions for setting up a Code Search Sitemap can be found in Google’s article “About Code Search Sitemaps” http://www.google.com/support/webmasters/bin/answer.py?answer=75225  and the subsequent pages.
 

Bruce M. Clay is a Founding and Charter Member of The Business Forum. He has operated as an executive with several high-technology businesses, and comes from a long career as a technical executive with leading Silicon Valley firms, and since 1996 in the Internet Business Consulting arena. Bruce holds a BS in Math and Computer Science and also has his MBA from Pepperdine University, has had many articles published, has been a speaker at over 100 sessions including Search Engine Strategies, WebmasterWorld, ad:Tech, Search Marketing Expo, and many more, and has been quoted in the Wall Street Journal, USA Today, PC Week, Wired Magazine, Smart Money, several books, and many other publications. He has also been featured on many podcasts and WebmasterRadio shows, as well as appearing on the NHK 1-hour TV special "Google's Deep Impact". He has personally authored many advanced search engine optimization tools that are available from the company Web sites. Bruce Clay is on the Board of Directors of the SEMPO (Search Engine Marketing Professionals Organization).  In 1996 he founded Bruce Clay, Inc., that today is a leading provider of Internet marketing solutions around the world with offices located internationally in Los Angeles (headquarters), New York, Milan, Tokyo, New Delhi, Sao Paulo and Sydney.


Visit the Authors Web Site

http://www.bruceclay.com


Search Our Site

Search the ENTIRE Business Forum site. Search includes the Business
Forum Library, The Business Forum Journal and the Calendar Pages.


Editorial PolicyNothing you read in The Business Forum Journal should ever be construed to be the opinion of, statements condoned by, or advice from, The Business Forum, its staff, workers, officers, members, directors, sponsors or shareholders. We pass no opinion whatsoever on the content of what we publish, nor do we accept any responsibility for the claims, or any of the statements made, within anything published herein.  We merely aim to provide an academic forum and an information sourcing vehicle for the benefit of the business and the academic communities of the Pacific States of America and the World.  Therefore, readers must always determine for themselves where the statistics, comments, statements and advice that are published herein are gained from and act, or not act, upon such entirely and always at their own risk.  We accept absolutely no liability whatsoever, nor take any responsibility for what anyone does, or does not do, based upon what is published herein, or information gained through the use of links to other web sites included herein.                     Please refer to our:  legal disclaimer
 



Home    Calendar    The Business Forum Journal    Features
Concept     History     Library    Formats    Guest Testimonials
Client Testimonials      Search      News Wire     Why Sponsor
Tell-A-Friend     Join    Experts   Contact The Business Forum


 


 

The Business Forum


Beverly Hills, California United States of America

Email:  [email protected]

Graphics by DawsonDesign

Webmaster:  bruceclay.com
 


© Copyright The Business Forum Institute 1982 - 2010  All rights reserved.