Crawl and Create XML Sitemaps of Large Websites

Scan very large websites with many pages. Improvement speed and memory usage during crawl. Create Text, HTML, RSS and XML sitemaps.

Common Website Reasons for Crawler Memory Usage

If A1 Sitemap Generator freezes during scan because it requires more memory (or requires better configuration), you may first want to check if the cause and solution resides on the website.

List of things to check:

Check if your website is generating an infinite amount of unique URLs. If it does, it will cause the crawler to never stop as new unique page URLs are found all the time. A good method to discover and solve these kinds of problems is by:
- Start a website scan.
- Stop the website scan after e.g. half an hour.
- Inspect if everything appears correct, i.e. if most of the URLs found seem correct.
Example #1
A website returns 200 instead of 404 for broken page URLs. Example of infinite pattern:
Original 1/broken.html links to 1/1/broken.html links to 1/1/1/broken.html etc.

Example #2
The website platform CMS generates a huge number of 100% duplicate URLs for each actual existing URL. To read more about duplicate URLs, see this help page. Remember that you can analyze and investigate internal website linking incase something looks wrong.
Check if your project configuration and website content will cause the crawler to download files hundred of megabytes large.

Example #1: Your website contains many huge files (like hundreds of megabytes) the crawler must download. (While the memory is freed after the download has completed, it can still cause problems on computers with low memory.)

Note: If you need extended help with analyzing your website for problems, we also offer sitemap and SEO services.

Crawler Speed and Webserver Performance

Webserver and database speed is important if you use database generated content. Forums, Portals and CMS based websites often trigger many SQL queries per HTTP request. Based on our experience, here are some examples of webserver performance bottlenecks:

Database

Check if the database performance is the primary bottleneck when being hit by many simultaneous connections and queries. In addition, make sure the database is not capped (e.g. through the license) at a maximum of simultaneous users/requests.

Resources and Files

If you read/write to files/resources, this may stall other connections if they require access to the same resources.

Resources and Sessions

On IIS webserver, read/write of session information can get queued using a locking mechanism - this can cause problems when dealing with a high load of HTTP requests from the same crawler session across many pages simultaneously.

user agent id

Scan When Traffic Is Low

Check your webserver logs and scan your website at times with low bandwidth usage.

You can automate website crawling and sitemap building.

Choosing The Correct Executable

The A1 Sitemap Generator setup file installs multiple executables into the program installation directory, each optimized for different systems. While the setup installation will pick the one that match your computer system the best, you may want to use one of the other in-case of too high memory usage.

The list of installed executables:

Sitemap_64b_UC.exe / Sitemap_64b_W2K.exe:
- Full support of Unicode with 64bit executable.
- Minimum requires: Windows XP / 64bit.
- Largest memory usage, but can also access much more memory if available.
Sitemap_32b_UC.exe / Sitemap_32b_W2K.exe:
- Full support of Unicode with 32bit executable.
- Minimum requires: Depending on version either Windows XP / 32 bit or Windows 2000 / 32bit.
- Somewhat lower memory usage.
Sitemap_32b_CP.exe / Sitemap_32b_W9xNT4.exe:
- Some support of Unicode with 32bit executable.
- Minimum requires: Windows 98 / 32bit.
- Lowest memory usage.

Crawler Configuration

Changing settings can make a huge difference if you are experiencing problems such as:

Scanning of website goes very slow.
Webserver is dropping connections preventing crawler from getting all pages.
You have large websites with 10.000 or 100.000 (+) pages you wish to scan.

Lessen overall and/or peak resource usage on crawler computer:

Disable Scan website | Data collection | Create log file of website scans
Disable Scan website | Data collection | Verify external URLs exist
Disable Scan website | Data collection | Store found external URLs
Disable Scan website | Data collection | Store redirects, links from and to all pages etc.
Disable Scan website | Data collection | Store additional details (e.g. which line in URL content a link is placed)
Disable Scan website | Data collection | Inspect URLs to detect language if necessary to identify
Disable Scan website | Data collection | Store titles for all pages
Disable Scan website | Data collection | Store "meta" description for all pages
Disable Scan website | Data collection | Store "meta" keywords for all pages
Disable Scan website | Data collection | Store and use "fallback" tags for title and description if necessary
Disable Scan website | Data collection | Store anchor text for all links
Disable Scan website | Data collection | Store "alt" attribute of all "uses"
Disable Scan website | Crawler options | Crawl error pages
Disable Scan website | Crawler options | Allow cookies
Disable Scan website | Webmaster filters | Obey "robots.txt" file "crawl delay" directive
Consider trying Scan website | Crawler engine | HTTP using WinInet engine and settings (Internet Explorer)
In top menu disable Tools | After website scans: Calculate summary data (extended)
In top menu disable Tools | After website scans: Open and show data

Create sitemap

Lessen overall resource usage on crawler and webserver computer:

In Scan website | Analysis filters and Scan website | Output filters;
- Configure which directories and pages website crawler can ignore:
  - Many forums generate pages with similar or duplicate content.
  - Excluding URLs can save a lot of HTTP requests and bandwidth.
  - See help for analysis filters and output filters.
  - Following happens when an URL is excluded:
    - Only analysis filters: Defaults to HEAD instead of GET for the HTTP requests
    - Only output filters: Defaults to remove the excluded URL after the website crawl finishes.
    - Both analysis filters and output filters: No HTTP request at all to the excluded URL.

Lessen peak resource usage on crawler and webserver computer:

In Scan website | Crawler engine
- Max simultaneous connections (data transfer):
  - Set up if webserver backend and your own bandwidth can handle more requests.
  - Set down if webserver backend (e.g. database queries) is a bottleneck.
- Max worker threads (transfer, analysis etc.):
  - Set up if your own computer CPU can handle more work.
  - Set down if you are doing other work on your computer.
- Adjust timeout options:
  - Miliseconds to wait before read times out.
  - Miliseconds to wait before connect times out.
  - Connection attempts before giving up.
  - How to change change settings:
    - Set up if you want to make sure you grab all links and pages.
    - Set down if you want website scan to be as fast as possible.
- Enable/disable usage of persistent connections.
- Enable/disable GZip compression of all data transferred between crawler and server.
- Enable/disable persistent connections which affects how client/server communicates.
- In option Default path type and handler experiment between using Indy and WinInet for HTTP.

When using pause / resume website scan functionality:

In Scan website | Output filters:
- See help about output filters.
- Uncheck Scan website | Output filters | After website scan stops: Remove URLs excluded:
  - Makes the website crawler keep all URLs. This ensures no URL is tested more than once.
  - URLs excluded by anything in output filters are tagged and still ignored when creating sitemaps.
In Scan website | Webmaster filters:
- See help about webmaster filters.
- Uncheck Scan website | Webmaster filters | After website scan stops: Remove URLs with noindex/disallow:
  - Makes the website crawler keep all URLs. This ensures no URL is tested more than once.
  - URLs excluded by anything in webmaster filters are tagged and still ignored when creating sitemaps.
See below concerning GET versus HEAD requests and how this can affect resume efficiency.

Lessen resource usage on webserver computer:

Use a content delivery network solution (CDN for short) to server content without putting any load on your own server.

Things that can affect resource usage on both server and desktop:

In Scan website | Crawler engine
- Enable/disable using GET requests (instead of HEAD followed by GET). If you plan on using resume functionality, you may want enable usage of HEAD requests as it allows website crawler to quickly resolve URLs found. This in turn speeds count of pages fully analyzed. Due to the internals of the crawler engine, using HEAD can also reduce memory usage.

Bonus tip: If you have a large website it can often be a good idea to make a few test scans. That way you can configure URL exclude filters before starting a major website crawl.

Project Files and Project Data Storage

If you have trouble loading a website project with extended website data:
- Project data is stored in a subdirectory to the project file.
  - Project file: c:\example\projects\demo.ini.
  - Project data directory: c:\example\projects\demo\.
- Delete project files that contain extended website data:
  - hotarea-normal-ex.xml.
  - hotarea-external-ex.xml.
- Try load the project file again.
Improve speed and memory usage when saving/loading projects:
- In top menu Options for save/load project:
  - Use optimized XML data storage.
  - Exclude extended website data.
  - Limit data to sitemap URLs.

Your Computer Configuration

The larger the website the more important memory becomes.
- Hardware memory is much faster than virtual memory.
- During website scan, lots of data gets stored in memory.
- If you have a gigantic website, consider increasing memory address space as explained in its own section below.
If you have bandwidth problems, you can try the following:
- Perform website scans during the night when your ISP has few online users.
- Make sure no download applications are running.

Increase Memory Address Space for 32bit Applications

Increasing memory address space enables Windows 32 bit software (including 32 bit versions of sitemap generator) to address beyond 2 gigabytes of memory.

http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx
- Memory Support and Windows Operating Systems
http://blogs.msdn.com/oldnewthing/archive/2004/08/05/208908.aspx
- The oft-misunderstood /3GB switch
http://blogs.msdn.com/oldnewthing/archive/2004/08/10/211890.aspx
- Myth: Without /3GB a single program can not allocate more than 2GB of virtual memory
http://blogs.msdn.com/slavao/archive/2006/03/12/550096.aspx
- Be Aware: 4GB of VAS under WOW, does it really worth it?
http://support.microsoft.com/kb/291988/
- A description of the 4 GB RAM Tuning feature and the Physical Address Extension switch
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dngenlib/html/msdn_ntvmm.asp
- The Virtual-Memory Manager in Windows NT (Note: Only for the very technical)

A1 Sitemap Generator | help | previous | next

Build all kinds of sitemaps including text, visual HTML / CSS, RSS, XML, image, video, news and mobile for all your websites no matter the platform they use.

This help page is maintained by Thomas Schulz

As one of the lead developers, his hands have touched most of the code in the software from Microsys. If you email any questions, chances are that he will be the one answering.