Microsys
  

Sitemap Generator Obeys Noindex, Nofollow, Canonical and Robots

Desktop sitemap generator tool can scan websites. There is optional support for obeying robots text file, noindex and nofollow in meta tags, and nofollow in link tags.
Help: overview | previous | next

Remember: To see all options in A1 Sitemap Generator you will have to switch off easy mode.

Sitemap Generator and Webmaster Crawl Filters

The website crawler in A1 Sitemap Generator has many tools and options to ensure it can scan complex websites. Some of these include complete support for robots text file, noindex and nofollow in meta tags, and nofollow in link tags.

Tip: Downloading robots.txt will often make webservers and analytics software identify you as robot website crawler.

crawl robots noindex nofollow

In connection with these, you can also control how they get applied:
  • Disable Scan website | Crawler options | Apply "webmaster" and "output filters" after website scan stops.
  • Enable Create sitemap | Document options | Remove URLs excluded by "webmaster" and "list" filters.

This is often useful if you intend to pause and resume crawls.

Note: You can also check the detected "state" flags that are related to webmaster filters. Just select the desired URL in Analyze website and view all information in Page data.

crawl filter state flags


HTML Code for Canonical, NoIndex and NoFollow

  • Canonical:
    <link rel="canonical" href="http://www.example.com/list.php?sort=az" />
    Useful in cases where two different URLs give same content. Consider reading about duplicate URLs as there may be better solutions than using canonical instructions, e.g. redirects.

  • NoFollow:
    • <a href="http://www.example.com/" rel="nofollow">bad link</a>
    • <meta name="robots" content="nofollow" />

  • NoIndex:
    <meta name="robots" content="noindex" />



Include and Exclude List and Analysis Filters

You can read more in our A1 Sitemap Generator online help system to learn about analysis and output filters.


Match Behavior and Wildcards Support in Robots.txt

The match behavior in the website crawler used by A1 Sitemap Generator is similar to that of most search engines.

Support for wildcard symbols in robots.txt file:
  • Standard: Match from beginning to length of filter.
    gre will match: greyfox, greenfox and green/fox.
  • Wildcard *: Match any character until another match becomes possible.
    gr*fox will match: greyfox, grayfox, growl-fox and green/fox.
    Tip: Wildcards filters in robots.txt are often incorrectly configured and a source of crawling problems.

The crawler in our sitemap generator tool will obey the following user agent IDs in the robots.txt file:
  • Exact match against user agent selected in: General options | Internet crawler | User agent ID.
  • User-agent: A1 Sitemap Generator if the product name is in above mentioned HTTP user agent string.
  • User-agent: miggibot if the crawler engine name is in above mentioned HTTP user agent string.
  • User-agent: *.

All found disallow instructions in robots.txt are internally converted into both analysis and output filters in A1 Sitemap Generator.
Help page primarily maintained and written by

As one of the lead developers in Microsys, his hands have touched almost all the code in the software available at this website. If you email any questions, chances are he will be the one answering them.
A1 Sitemap GeneratorAbout A1 Sitemap Generator

Build all kinds of sitemaps including text, visual HTML / CSS, RSS, XML, image, video, news and mobile for all your websites no matter the platform they use.
share   LinkedIn   Twitter   Facebook   Pinterest   Google+   YouTube  
 © Copyright 1997-2014 Microsys
 Usage of this website constitutes an accept of our legal, privacy and cookies information.