Abstract: Improve website crawler speed and memory usage during site scan. Scan websites with many pages. Create Text, HTML, RSS and XML sitemaps. Navigate:Sitemap Generator | Buy | Download | Help Index
Crawler Speed and Webserver Performance
Webserver speed is important if you use database generated content.
Forums, Portals and CMS based websites often trigger many SQL queries per HTTP request.
Tips on how to improve your webserver performance when scanning:
If database queries is the primary bottleneck, consider upgrading the webserver.
Make sure the database is not capped (e.g. through license) at a maximum of simultaneous users/requests.
If you read/write to files/resources, this may stall other connections if they require access to the same resources.
Check your webserver logs and scan your website at times with low bandwidth usage.
Uncheck Scan website > Crawler options > Apply "webmaster" and "output" filters after website scan:
Makes the website crawler keep all URLs. This ensures no URL is tested more than once.
URLs matched by output filters during website scan are tagged and ignored when creating sitemaps.
See below concerning GET versus HEAD requests and how this can affect resume efficiency.
Bonus tip: If you have a large website it can often be a good idea to make a few test scans. That way you can configure URL exclude filters before starting a major website crawl.
Lessen peak resource usage on crawler and webserver computer:
In Scan website > Crawler engine
Adjust the number of max simultaneous connections:
Set up if webserver backend and bandwidth can handle more requests.
Set down if webserver backend (e.g. database queries) is a bottleneck.
Adjust timeout options:
Miliseconds to wait before read times out.
Miliseconds to wait before connect times out.
Connection attempts before giving up.
How to change change settings:
Set up if you want to make sure you grab all links and pages.
Set down if you want website scan to be as fast as possible.
Enable/disable usage of persistent connections.
Enable/disable using GET requests (instead of HEAD followed by GET). If you plan on using resume functionality, you may want enable usage of HEAD requests as it allows website crawler to quickly resolve URLs found. This in turn speeds count of pages fully analyzed.
Enable/disable GZip compression of all data transferred between crawler and server.
Enable/disable persistent connections which affects how client/server communicates.
Project Files and Project Data Storage
If you have trouble loading a website project with extended website data:
Project data is stored in a subdirectory to the project file.
Project file: c:\example\projects\demo.ini.
Project data directory: c:\example\projects\demo\.
Delete project files that contain extended website data:
hotarea-normal-ex.xml.
hotarea-external-ex.xml.
Try load the project file again.
Improve speed and memory usage when saving/loading projects:
In top menu Options for save/load project:
Use optimized XML data storage.
Exclude extended website data.
Limit data to sitemap URLs.
Your Computer Configuration
The larger the website the more important memory becomes.
Hardware memory is much faster than virtual memory.
During website scan, lots of data gets stored in memory.
If you have a gigantic website, consider increasing memory address space as explained in its own section below.
If you have bandwidth problems, you can try the following:
Perform website scans during the night when your ISP has few online users.
Make sure no download applications are running.
Increase Memory Address Space for 32bit Applications
Increasing memory address space enables Windows 32 bit version of products such as
our sitemap generator to address beyond 2 gigabytes of memory.