Microsys
  

A1 Website Scraper Can Handle Most Website Platforms and Problems

The crawler in A1 Website Scraper is like most website crawlers: Indifferent to backend platform used by the website. It does not matter what shopping cart or server side language, such as PHP or ASP, is used on the site.
Help: overview | previous | next

 To see all the options available, you will have to switch off easy mode 

 With options that use a dropdown list, any [+] or [-] button next to adds or removes items in the list itself 

Your Website Type and Server Does Not Matter


URL Rewriting and Similar


Your website can without problems use virtual directories or URL rewriting. Many people use Apache mod_rewrite to create virtual directories and URLs. Similar solutions exist for nearly all webserver solutions.

From a crawler perspective, websites can safely use virtual directories and URLs. Browsers, search engine bots etc. all view your website from the outside. They have no knowledge of how your URL structure is implemented. They can not tell if pages or directories are physical or virtual.


Server Side Language


If a website uses using Cold Fusion, ASP.Net, JSP, PHP or similar as server side language has no consequence. Website crawlers only see the client-side code (HTML/CSS/Javascript) generated by the server-side code. (Whatever language is used to code it.)

Note: In settings, the crawler in our site analysis tool can be set to accept/ignore URLs with certain file extensions. If you have troubles, read about finding all pages and links.


Dynamicly Created Content on Server


Sites that dynamicly generate page content using server-side scripts and databases are crawled without problems by site crawlers and robots.

Note: Some search engine robots may slow when crawling URLs with ?. However, that is mainly because search engine are worried spending crawling resourced on lots of URLs with auto generated content. To mitigate this, you can use mod rewrite or similar in your website.

Note: Our website scraper and the MiggiBot crawler engine does not care about how URLs look.


Mobile Websites


Many websites nowadays use responsive and adaptive layouts that adjust themselves in the browser using client-side technologies, e.g. CSS and Javascript.

However, some websites have special website URLs for:
  • Feature phones that only support WAP and similar old technologies.
  • Smartphones with browsers that are very similar to desktop browsers and render content the same way.
  • Desktop computers, laptops and tablets where the screen area and view port is larger.

Generally, such mobile optimized websites know they need to output content optimized for mobile devices by either:
  • Assume they should always output content opimized for a given set of mobile devices, e.g. smart phones.
  • Perform server-side checks on the user agent passed to it by the crawler or browser. Then, if a mobile device is identified, it will eiher redirect to a new URL or simply output content optimized for mobile devices.

If you want the A1 Website Scraper crawler to see the mobile content and URLs your website outputs to mobile devices, simply change the setting General options | Internet crawler | User agent ID to one used by popular mobile devices, e.g this:

Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19.

You can do the same with most desktop browsers by installing a user agent switcher plugin. That allows you to inspect the code returned by your website to mobile browsers.

Note: If your mobile optimized website uses a mix of client-side and server-side technologies such as AJAX to detect the user agent and alter content based on it, it will not work on many website crawlers including, as of 2014 September at least, A1 Website Scraper. However, it will work in most browsers since they run/execute Javascript code which can query the browser for the user agent ID.


Verify Website HTML Output to Crawlers and Browsers

Normally, websites never cloak content based on user agent string and IP address. However, by setting the useragent ID you can check the HTML source search engines and browsers see when retrieving pages form your website.

user agent id

Note: This can also be used to test if a website responds correctly to a crawler/browser that identifies itself as being mobile.


How to Successfully and Completely Scan Your Website

If are experiencing website crawling problems, you should read about answer about finding all links and pages in website scans.

You should also make sure that there is not any firewall or internet security software blocking A1 Website Scraper.


Problematic Websites and Specific Website Platforms


Website bandwidth filtering and/or throttling


Some few website platforms take measures against crawlers they do not recognize to reserve bandwidth and server usage for real visitors and search engines. Here is a list of known solutions for those website platforms:
If you are trying to crawl a forum, check our guide to optimal crawling of forums and blogs with website scraper.


Website erraticly sends the wrong page content


We have seen a few cases where the website, server, CDN, CMS or cache system suffered from a bug and sent the wrong output page content when being crawled.

To prove and diagnose such a problem, download and configure A1 Website Download like this:
  • Set Scan website | Download options | Convert URL paths in downloaded content to no conversion.
  • Enable Scan website | Data collection | Store redirects and links from and to all pages.
  • Enable all options in Scan website | Webmaster filters.

You can now compare the downloaded page source code with what is reported in the A1 Website Download, and see if the webserver/website sent correct or incorrect content to the A1 crawler engine.

To solve such an issue without access to the website and webserver code, try use some of the configurations suggested further down below.


General Solutions for Crawling Problematic Websites

  • Set Scan website | Crawler engine | Max simultaneous connections to one.
  • Set Crawler engine | Advanced engine settings | Default to GET requests to checked/enabled.
  • Then, if necessary, have the webcrawler identify itself as a search engine or as a user surfing.
    • Identify as "user surfing website":
      • Set General options | Internet crawler | User agent ID to Mozilla/4.0 (compatible; MSIE 7.0; Win32).
      • Set Scan website | Webmaster filters | Download "robots.txt" to unchecked/disabled.
      • In Scan website | Crawler engine increase the amount of time between active connections.
      • Optional:: Set Scan website | Crawler engine to HTTP using WinInet engine and settings (Internet Explorer)
    • Identify as "search engine crawler":
      • Set General options | Internet crawler | User agent ID to Googlebot/2.1 (+http://www.google.com/bot.html) or another search engine crawler ID.
      • Set Scan website | Webmaster filters | Download "robots.txt" to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "robots.txt" file "disallow" directive to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "robots.txt" file "crawl-delay" directive to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "meta" tag "robots" noindex to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "meta" tag "robots" nofollow to checked/enabled.
      • Set Scan website | Webmaster filters | Obey "a" tag "rel" nofollow to checked/enabled.

If you continue to have problems, you can combine the above with:
  • Set Scan website | Crawler engine | Crawl-delay in miliseconds between connections to at least 3000.

You can also download our default project file for problematic websites as you can often apply the same solutions to a wide range of websites.
This help page is maintained by

As one of the lead developers, his hands have touched most of the code in the software from Microsys.

If you email any questions, chances are that he will be the one answering them.
A1 Website ScraperAbout A1 Website Scraper

Extract data from sites into CSV files. By scraping websites, you can grab data on websites and transform it into CSV files ready to be imported anywhere, e.g. SQL databases
     
share   LinkedIn   Twitter   Facebook   Pinterest   Google+   YouTube  
 © Copyright 1997-2016 Microsys
 Usage of this website constitutes an accept of our legal, privacy and cookies information.