Website Scraper Crawl Does Not Find All URLs

What to do if you only see few links after website crawl by website scraper program.
Help: overview | previous | next

 To see all the options available, you will have to switch off easy mode 

 With options that use a dropdown list, any [+] or [-] button next to adds or removes items in the list itself 

When Website Scraper Program Finds Too Few or Odd Page URLs

First read how A1 Website Scraper helps find website linking problems. Then go through this check list:

  • Using a firewall program? You need to configure it if website scan returns few URLs, and they all have response code -4 : CommError.

  • Are you mixing www and non-www usage in website links and redirects? Check the externals tab to know.

  • Does your website use website cloaking, i.e. change content depending on the user agent string used by crawler? Then change the user agent string used by the crawler to identify itself: General options | Internet crawler | User agent ID.

  • Does your website and/or pages in it redirect to or get content from another domain? (e.g. through <frame> or <iframe>) Check the externals tab to know.

  • Is an entire section of pages hidden and not linked at all from the other parts of the website? In this case, having cross-linked all hidden pages is no help! To solve this, you can use multiple start search paths.

  • Does the website rely on Javascript or uncommon types of HTML link tags for website navigation, e.g. <iframe>, <form> and <button>? Solution: Enable checking these things for links in Scan website | Crawler options.

  • Does the website use // instead of / in links? And does the webserver not respond with an error or redirect in such cases? And does the problem cascade if the page URL is linked using relative paths? Solution: Configure Scan website | Crawler options to handle this situation.

  • Does the website have a dynamic page that generates unique links based on input from GET ? data? This can sometimes cause an endless loop of unique URLs!

  • Various exclude filters in program configuration and website:
    Remember: If you exclude some page URLs from analysis, the crawler does not find or follow the links on those pages. If you have URLs that are otherwise not linked from anywhere - those URLs will never be found or analyzed.

    Besides URL filtering support, you can also configure when filtered URLs are removed in Scan website | Crawler options | Apply "webmaster" and "output filters" after website scan stops.

    Switching above setting off makes it easier to locate possible reasons for missing URLs, e.g. if you are excluding important URLs. The Crawler state flags section for each URL gives important information.

    To find out why a page URL is missing in regular scans do this:
    1. Switch off Scan website | Crawler options | Apply "webmaster" and "output filters" after website scan stops and run a full website scan.
    2. Verify that the normally missing page URL is not found. If it is found, check its crawler state flags to ensure it is not being flagged as being excluded and to be removed.
    3. If the page is not found, continue above step with page URLs you believe are linking to the missing page URL.
    4. When you have located a page found by A1 Website Scraper that you believe links to a missing page URL, verify its HTML source.

    crawler state flags

  • Are you scanning a website subdirectory which contains no links to pages within that directory? Check externals tab to know.

  • Consider if your website is using non-standard file extensions. If you know which, you can add them:

    sitemapper crawl file extensions

    sitemapper list file extensions

    Alternatively, clear all file extensions in analysis and output filters, but keep the default MIME filters both places. Then try scan again.

  • Do you have directories with response code 0 : VirtualItem in scan results? Check the information about internal website linking.

  • Are there many URLs with errors in the website scan results? If the webserver is causing some URLs to give error response codes, e.g. because of server bandwidth throttling, you can try resume scan until all errors are gone. This will most likely lead to more found links and pages.

    Another solution towards solving URLs with error responses is to experiment with options found in Scan website | Crawler engine | Advanced engine settings. Some common settings which often help: Increasing timeout values, using GET only and enabling/disabling GZip/defalte support.

    website scraper unstable servers

This help page is maintained by

As one of the lead developers, his hands have touched most of the code in the software from Microsys.

If you email any questions, chances are that he will be the one answering them.
A1 Website ScraperAbout A1 Website Scraper

Extract data from sites into CSV files. By scraping websites, you can grab data on websites and transform it into CSV files ready to be imported anywhere, e.g. SQL databases
Share this page with friends   LinkedIn   Twitter   Facebook   Pinterest   Google+   YouTube  
 © Copyright 1997-2018 Microsys

 Usage of this website constitutes an accept of our legal, privacy policy and cookies information.