Microsys
  

Crawler Handles Website Platforms and Problems

The crawler in A1 Website Scraper is like most website crawlers: Indifferent to back-end platform used by the website. It does not matter what shopping cart or server side language, such as PHP or ASP, is used on the site.

Your Website Type and Server Does Not Matter


URL Rewriting and Similar


Your website can without problems use virtual directories or URL rewriting. Many people use Apache mod_rewrite to create virtual directories and URLs. Similar solutions exist for nearly all webserver solutions.

From a crawler perspective, websites can safely use virtual directories and URLs. Browsers, search engine bots etc. all view your website from the outside. They have no knowledge of how your URL structure is implemented. They can not tell if pages or directories are physical or virtual.


Server Side Language and HTML Web Pages


In a modern website, there is often little or no correlation between the URL "file names" and the underlying data including how it is generated, stored and retrieved. It does not matter if a website uses Cold Fusion, ASP.Net, JSP, PHP or similar as its server side programming language. Website crawlers only see the client-side code (HTML/CSS/Javascript) generated by the code and databases on the server.

Note: In settings, the crawler in our site analysis tool can be set to accept/ignore URLs with certain file extensions and MIME content types. If you have troubles, read about finding all pages and links.


Dynamically Created Content on Server


Sites that dynamically generate page content using server-side scripts and databases are crawled without problems by site crawlers and robots.

Note: Some search engine robots may slow when crawling URLs with ?. However, that is mainly because search engine are worried spending crawling resourced on lots of URLs with auto generated content. To mitigate this, you can use mod rewrite or similar in your website.

Note: Our website scraper and the MiggiBot crawler engine does not care about how URLs look.


Mobile Websites


Many websites nowadays use responsive and adaptive layouts that adjust themselves in the browser using client-side technologies, e.g. CSS and Javascript.

However, some websites have special website URLs for:
  • Feature phones that only support WAP and similar old technologies.
  • Smartphones with browsers that are very similar to desktop browsers and render content the same way.
  • Desktop computers, laptops and tablets where the screen area and view port is larger.

Generally, such mobile optimized websites know they need to output content optimized for mobile devices by either:
  • Assume they should always output content opimized for a given set of mobile devices, e.g. smart phones.
  • Perform server-side checks on the user agent passed to it by the crawler or browser. Then, if a mobile device is identified, it will eiher redirect to a new URL or simply output content optimized for mobile devices.

If you want the A1 Website Scraper crawler to see the mobile content and URLs your website outputs to mobile devices, simply change the setting General options and tools | Internet crawler | User agent ID to one used by popular mobile devices, e.g this:

Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19.

You can do the same with most desktop browsers by installing a user agent switcher plugin. That allows you to inspect the code returned by your website to mobile browsers.

Note: If your mobile optimized website uses a mix of client-side and server-side technologies such as AJAX to detect the user agent and alter content based on it, it will not work on many website crawlers including, as of 2014 September at least, A1 Website Scraper. However, it will work in most browsers since they run/execute Javascript code which can query the browser for the user agent ID.


AJAX Websites


If your website uses AJAX which is a technology where Javascript communicates with the server and alters the content in-browser without changing URL address, it is worth knowing that crawlability will depend on the exact implementation.

Explanation of fragments in URLs:
  1. Page-relative-fragments: Relative links within a page:
    http://example.com/somepage#relative-page-link
  2. AJAX-fragments: client-side Javascript that queries server-side code and replaces content-in-browser:
    http://example.com/somepage#lookup-replace-data
  3. AJAX-fragments-Google-initiative: Part of the Google initiative Making AJAX Applications Crawlable:
    http://example.com/somepage#!lookup-replace-data

    This solution has since been deprecated by Google themselves. For more information see:
    https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
    https://developers.google.com/webmasters/ajax-crawling/docs/specification

    If you use this solution, you will see URLs containing #! and _escaped_fragment_ when crawled.

Tip: To help successfully crawl AJAX websites:
  • Select an AJAX enabled crawler option in Scan website | Crawler engine.
  • Enable option Scan website | Crawler options | Try search inside Javascript.
  • Enable option Scan website | Crawler options | Try search inside JSON.


Verify Website HTML Output to Crawlers and Browsers

Normally, websites never cloak content based on user agent string and IP address. However, by setting the useragent ID you can check the HTML source search engines and browsers see when retrieving pages form your website.

user agent id

Note: This can also be used to test if a website responds correctly to a crawler/browser that identifies itself as being mobile.


How to Successfully and Completely Scan Your Website

If are experiencing website crawling problems, you should read about answer about finding all links and pages in website scans.

You should also make sure that there is not any firewall or internet security software blocking A1 Website Scraper.


Problematic Websites and Specific Website Platforms


Website bandwidth filtering and/or throttling


Some few website platforms and module take measures against crawlers they do not recognize to reserve bandwidth and server usage for real visitors and search engines. Here is a list of known solutions for those website platforms:
If you are trying to crawl a forum, check our guide to optimal crawling of forums and blogs with website scraper.


Website Erratically Sends The Wrong Page Content


We have seen a few cases where the website, server, CDN, CMS or cache system suffered from a bug and sent the wrong output page content when being crawled.

To prove and diagnose such a problem, download and configure A1 Website Download like this:
  • Set Scan website | Download options | Convert URL paths in downloaded content to no conversion.
  • Enable Scan website | Data collection | Store redirects and links from and to all pages.
  • Enable all options in Scan website | Webmaster filters.

You can now compare the downloaded page source code with what is reported in the A1 Website Download, and see if the webserver/website sent correct or incorrect content to the A1 crawler engine.

To solve such an issue without access to the website and webserver code, try use some of the configurations suggested further down below.


General Solutions for Crawling Problematic Websites

If you encounter a website that throttles crawler requests, blocks certain user agents or is very slow you will often get response codes such as:
  • 403 : Forbidden
  • 503 : Service Temporarily Unavailable
  • -5 : TimeoutConnectError
  • -6 : TimeoutReadError

To solve these try the following:
  • Set Scan website | Crawler engine | Max simultaneous connections to one.
  • Set Crawler engine | Advanced engine settings | Default to GET requests to checked/enabled.
  • Then, if necessary, have the webcrawler identify itself as a search engine or as a user surfing.
    • Identify as "user surfing website":
      • Set General options and tools | Internet crawler | User agent ID to Mozilla/4.0 (compatible; MSIE 7.0; Win32).
      • Set Scan website | Webmaster filters | Download "robots.txt" to unchecked/disabled.
      • In Scan website | Crawler engine increase the amount of time between active connections.
      • Optional:: Set Scan website | Crawler engine to HTTP using WinInet engine and settings (Internet Explorer)
    • Identify as "search engine crawler":
      • Set General options and tools | Internet crawler | User agent ID to Googlebot/2.1 (+http://www.google.com/bot.html) or another search engine crawler ID.
      • Set Scan website | Webmaster filters | Download "robots.txt" to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "robots.txt" file "disallow" directive to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "robots.txt" file "crawl-delay" directive to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "meta" tag "robots" noindex to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "meta" tag "robots" nofollow to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "a" tag "rel" nofollow to checked/enabled.

Note: If you continue to have problems, you can combine the above with:
  • Set Scan website | Crawler engine | Crawl-delay in miliseconds between connections to at least 3000.

Note: If your IP address has been blocked, you can try use General options and tools | Internet crawler | HTTP proxy settings. Proxy support depends on which HTTP engine has been selected in Scan website | Crawler engine.

Note: If the problem is timeout errors, you can also try do repeat crawls with the resume scan functionality.

You can also download our default project file for problematic websites as you can often apply the same solutions to a wide range of websites.
A1 Website Scraper
A1 Website Scraper | help | previous | next
Extract data from sites into CSV files. By scraping websites, you can grab data on websites and transform it into CSV files ready to be imported anywhere, e.g. SQL databases
This help page is maintained by
As one of the lead developers, his hands have touched most of the code in the software from Microsys. If you email any questions, chances are that he will be the one answering.
Share this page with friends   LinkedIn   Twitter   Facebook   Pinterest   YouTube  
 © Copyright 1997-2024 Microsys

 Usage of this website constitutes an accept of our legal, privacy policy and cookies information.