Sitemap Generator Obeys Noindex, Nofollow, Canonical and Robots
Desktop sitemap generator tool can scan websites. There is optional support for obeying robots text file, noindex and nofollow in meta tags, and nofollow in link tags.
Sitemap Generator and Webmaster Crawl Filters
The website crawler in A1 Sitemap Generator
has many tools and options to ensure it can scan complex websites. Some of these include
complete support for robots text file, noindex and nofollow in meta tags, and nofollow in link tags.
will often make webservers and analytics software identify you as a website crawler robot
In connection with these, you can also control how they get applied:
- Disable Scan website | Crawler options | Apply "webmaster" and "output filters" after website scan stops.
- Enable Create sitemap | Document options | Remove URLs excluded by "webmaster" and "list" filters.
This is often useful if you intend to pause and resume crawls
You can also check the detected "state" flags that are related to webmaster filters.
Just select the desired URL in Analyze website
and view all information in Page data
HTML Code for Canonical, NoIndex and NoFollow
<link rel="canonical" href="http://www.example.com/list.php?sort=az" />
Useful in cases where two different URLs give same content.
Consider reading about
duplicate URLs as there may be better solutions than using canonical instructions, e.g. redirects.
<a href="http://www.example.com/" rel="nofollow">bad link</a>
<meta name="robots" content="nofollow" />
<meta name="robots" content="noindex" />
Include and Exclude List and Analysis Filters
You can read more in our online help for A1 Sitemap Generator to learn about
Match Behavior and Wildcards Support in Robots.txt
behavior in the website crawler used by A1 Sitemap Generator is similar to that of most search engines.
Support for wildcard
symbols in robots.txt
Standard: Match from beginning to length of filter.
gre will match: greyfox, greenfox and green/fox.
Wildcard *: Match any character until another match becomes possible.
gr*fox will match: greyfox, grayfox, growl-fox and green/fox.
Tip: Wildcards filters in robots.txt are often incorrectly configured and a source of crawling problems.
The crawler in our sitemap generator tool will obey the following user agent IDs
in the robots.txt
- Exact match against user agent selected in: General options | Internet crawler | User agent ID.
- User-agent: A1 Sitemap Generator if the product name is in above mentioned HTTP user agent string.
- User-agent: miggibot if the crawler engine name is in above mentioned HTTP user agent string.
- User-agent: *.
All found disallow
instructions in robots.txt
are internally converted into
in A1 Sitemap Generator