Microsys
        

Obey Noindex, Nofollow, Canonical and Robots.txt in Website Analyzer

Website Analyzer and Webmaster Crawl Filters

The website crawler in A1 Website Analyzer has many tools and options to ensure it can scan complex websites. Some of these include complete support for robots text file, noindex and nofollow in meta tags, and nofollow in link tags.

Tip: Downloading robots.txt will often make webservers and analytics software identify you as robot website crawler.

crawl robots noindex nofollow

In connection with these, you can also control how they get applied:
  • Disable Scan website | Crawler options | Apply "webmaster" and "list" filters after website scan.
  • Enable Create sitemap | Document options | Remove URLs excluded by "webmaster" and "list" filters.

This is often useful if you intend to pause and resume crawls.

Note: You can also check the detected "state" flags that are related to webmaster filters. Just select the desired URL in Analyze website and view all information in Page data.

crawl filter state flags


HTML Code for Canonical, NoIndex and NoFollow

  • Canonical:
    <link rel="canonical" href="http://www.example.com/list.php?sort=az" />
    Useful in cases where two different URLs give same content. Consider reading about duplicate URLs as there may be better solutions than using canonical instructions, e.g. redirects.

  • NoFollow:
    • <a href="http://www.example.com/" rel="nofollow">bad linklink</a>
    • <meta name="robots" content="nofollow" />

  • NoIndex:
    <meta name="robots" content="noindex" />



Include and Exclude List and Analysis Filters

You can read more in our online sitemap generator help about analysis and output filters.


Match Behavior and Wildcards Support in Robots.txt

The match behavior in the website crawling engine used by our sitemap generator is similar to most search engines.

Support for wildcard symbols in robots.txt file:

  • Standard: Match from beginning to length of filter.
    gre will match: greyfox, greenfox and green/fox.
  • Wildcard *: Match any character until another match becomes possible.
    gr*fox will match: greyfox, grayfox, growl-fox and green/fox.
    Tip: Wildcards filters in robots.txt are often incorrectly configured and a source of crawling problems.

Sitemap generator match against following user-agents strings in robots.txt:
  • Exact match against user agent selected in: General options | Internet crawler | User agent ID.
  • User-agent: A1 Website Analyzer if product name is in above mentioned user agent string.
  • User-agent: *.

All found disallow instructions in robots.txt are internally converted into both analysis and output filters. in A1 Website Analyzer.

Website software tools


Business software utilities


Popular freeware programs

Online tools


Webmaster articles


Website promotion resources

 © Copyright 1997-2010 Microsys | about | contact | legal | privacy