The website crawler in
A1 Website Analyzer
has many tools and options to ensure it can scan complex websites. Some of these include
complete support for robots text file, noindex and nofollow in meta tags, and nofollow in link tags.
Tip: Downloading
robots.txt
will often make webservers and analytics software identify you as
robot website crawler.
In connection with these, you can also control how they get applied:
- Disable Scan website | Crawler options | Apply "webmaster" and "list" filters after website scan.
- Enable Create sitemap | Document options | Remove URLs excluded by "webmaster" and "list" filters.
This is often useful if you intend to
pause and resume crawls.
Note: You can also check the detected "state" flags that are related to webmaster filters.
Just select the desired URL in
Analyze website and view all information in
Page data.
The
match behavior in the website crawling engine used by our sitemap generator is similar to most search engines.
Support for wildcard symbols in robots.txt file:
-
Standard: Match from beginning to length of filter.
gre will match: greyfox, greenfox and green/fox.
-
Wildcard *: Match any character until another match becomes possible.
gr*fox will match: greyfox, grayfox, growl-fox and green/fox.
Tip: Wildcards filters in robots.txt are often incorrectly configured and a source of crawling problems.
Sitemap generator match against following
user-agents strings in
robots.txt:
- Exact match against user agent selected in: General options | Internet crawler | User agent ID.
- User-agent: A1 Website Analyzer if product name is in above mentioned user agent string.
- User-agent: *.
All found
disallow instructions in
robots.txt are internally converted into
both
analysis
and
output
filters.
in
A1 Website Analyzer.