Microsys

Scan Large Websites and Create XML Sitemaps

Crawler speed and webserver performance - Forums and portals

Webserver speed is important if you use database generated content. Forums, Portals and CMS based websites often trigger many SQL queries per HTTP request.

Tips on how to improve your webserver performance when scanning:
  • If database queries is the primary bottleneck, consider upgrading the webserver.
  • Make sure the database is not capped (e.g. through license) at a maximum of simultaneous users/requests.
  • If you read/write to files/resources, this may stall other connections if they require access to the same resources.
  • Check your webserver logs and scan your website at times with low bandwidth usage.


Crawler configuration - Boost speed and minimize memory usage

Changing settings can make a huge difference if you are experiencing problems such as:
  • Scanning of website goes very slow.
  • Webserver is dropping connections preventing crawler from getting all pages.
  • You have large websites with 10.000 or 100.000 (+) pages you wish to scan.

Lessen overall resource usage on crawler computer:
  • Disable Scan website > Data collection > Tracking and storage of extended website data.
  • Disable Scan website > Data collection > Storage of external links.
  • Disable Scan website > Data collection > Storage of response headers.
  • Disable Scan website > Data collection > Logging of progress.
  • Disable Scan website > Crawler options > Crawling pages with error response codes.
  • Disable Scan website > Crawler options > Cookies support.
  • Disable Scan website > Webmaster filters > Obey "robots.txt" file "crawl delay" directive.

Lessen peak resource usage on crawler computer:
  • In top menu Options disable showing result data after website scan. Create sitemap will open instead.

Lessen overall resource usage on crawler and webserver computer:
  • In Scan website > Crawler filters
    • Configure which directories and pages website crawler can ignore:

  • Using Scan website > Output filters and website scan pause / resume:
    • See help about Website scan output filters.
    • Uncheck Scan website > Crawler options > Apply "webmaster" and "output" filters after website scan:
      • Makes the website crawler keep all URLs. This ensures no URL is tested more than once.
      • URLs matched by output filters during website scan are tagged and ignored when creating sitemaps.

Lessen peak resource usage on crawler and webserver computer:
  • In Scan website > Crawler engine
    • Adjust the number of max simultaneous connections:
      • Set up if webserver backend and bandwidth can handle more requests.
      • Set down if webserver backend (e.g. database queries) is a bottleneck.
    • Adjust timeout options:
      • Miliseconds to wait before read times out.
      • Miliseconds to wait before connect times out.
      • Connection attempts before giving up.
      • How to change change settings:
        • Set up if you want to make sure you grab all links and pages.
        • Set down if you want website scan to be as fast as possible.
    • Enable/disable usage of persistent connections.
    • Enable/disable using GET requests (instead of HEAD followed by GET).
    • Enable/disable GZip compression of all data transferred between crawler and server.


Project Files and Project Data Storage

  • If you have trouble loading a website project with extended website data:
    • Project data is stored in a subdirectory to the project file.
      • Project file: c:\example\projects\demo.ini.
      • Project data directory: c:\example\projects\demo\.
    • Delete project files that contain extended website data:
      • hotarea-normal-ex.xml.
      • hotarea-external-ex.xml.
    • Try load the project file again.
  • Improve speed and memory usage when saving/loading projects:
    • In top menu Options for save/load project:
      • Use optimized XML data storage.
      • Exclude extended website data.
      • Limit data to sitemap URLs.


Your computer configuration - Memory and bandwidth

  • The larger the website the more important memory becomes.
    • Hardware memory is much faster than virtual memory.
    • During website scan, lots of data gets stored in memory.
    • If you have a gigantic website, consider increasing memory address space. This enables Windows 32 bit version of products such as A1 Sitemap Generator to address beyond 2 gigabytes of memory.
  • If you have bandwidth problems, you can try the following:
    • Perform website scans during the night when your ISP has few online users.
    • Make sure no download applications are running.


Website software tools


Business software utilities


Popular freeware programs

Online tools


Webmaster articles


Website promotion resources

 © Copyright 1997-2010 Microsys | about | contact | legal | privacy