Microsys

Resume Scan of Large Websites on Unstable Webservers

Resume Scan to Fix URLs with Error Response Codes

There can be many reasons for resume a website scan after it stopped. One being webservers that overload and respond with error codes for URLs, typically
  • 503 : Service Temporarily Unavailable
  • 500 : Internal Server Error
  • -4 : CommError

An easy way to determine if any errors are left throughout entire website is to view directory summary for the root directory / domain.

website directory summary

Then use the resume functionality in A1 Sitemap Generator to recrawl these URLs. Continue to do so until no errors are left.

resume website scan

While the website scan is running you can at any time:
  1. Pause and stop the website scan.
  2. Save your project and create sitemaps.
  3. Resume and continue your website scan.

How to use website scan and crawl resume functionality:
  • To pause a scan is the same as to stop a scan. If scan is stopped/paused or internet disconnects, you can resume.
  • To resume a scan you need to press Resume button so it appears in pushed down state. Then click Start scan button.


You can force recrawl of certain URLs by changing their state flags in Website analysis before resuming website scan:

resume crawl state flags


Pausing Website Scan Removes URLs

Default behavior after a website scan finishes for whatever reason, e.g. being paused, all URLs that are excluded by webmaster filters such as robots.txt and output filters are normally removed.

Above behavior is not always wanted since it means website crawler will spend time rediscover URLs it has tested before. Instead you can do the following:

  • Disable Scan website | Crawler options | Apply "webmaster" and "list" filters after website scan.
  • Enable Create sitemap | Document options | Remove URLs excluded by "webmaster" and "list" filters.

If filtered URLs are not removed in website scan results, you can view filtering state flags:

resume crawl filter flags


Resume Website Scan Analyzes URLs Again

To understand how website crawl works, we have:
  • Found URLs: These are URLs that have been resolved and tested.
  • Analyzed content URLs: The content of these URLs (pages) have been analyzed.
  • Analyzed references URLs: The links found in content of these URLs (pages) have been resolved.

This means that all pages where all links have not been resolved will need to be analyzed again when resuming scan. To avoid this problem the website crawler can use HEAD requests to quickly resolve links to verified URLs. While this causes requests during crawl to server (albeit all light), it will minimize waste to almost zero when using resume functionality.

To configure this, disable: Scan website > Crawler engine > Default to GET for page requests


Find More URLs in Website Scan

If you have trouble getting all URLs included in sitemaps generated, it is important to first follow above recommendation. The reason is that URLs with error response codes are not crawled for links. By solving that problem, you will usually also end up with more URLs.

You can find more suggestions and tips in our website crawl help article.

Website software tools


Business software utilities


Popular freeware programs

Online tools


Webmaster articles


Website promotion resources

 © Copyright 1997-2010 Microsys | about | contact | legal | privacy