Microsys
        

Problematic Website Platforms and Website Download Crawler

Website Crawlers Can not Tell

  • Using virtual directories with URL rewriting? Site crawlers do not care if directories in URLs are physical on the disk or virtual.

  • Websites using Cold Fusion, ASP.Net, JSP, PHP etc. as server side language has no consequence. Website crawlers only see the HTML generated by the server-side language.

    Note: In settings, the crawler in our site analysis tool can be set to accept/ignore URLs with certain file extensions. If you have troubles, read about finding all pages and links.

  • Sites dynamicly generated by scripts and databases are crawled without problems by site crawlers and robots.

    Note: Some search engine robots may slow when crawling URLs with ?. However, that is mainly because search engine are worried spending crawling resourced on lots of URLs with auto generated content. To mitigate this, you can use mod rewrite or similar in your website.

    Note: Our website download and the MiggiBot crawler engine does not care about how URLs look.



Verify Website HTML Output to Crawlers and Browsers

Normally, websites never cloak content based on user agent string and IP address. However, by setting the useragent ID you can check the HTML source search engines and browsers see when retrieving pages form your website.

ignore logout paths


How to Successfully and Completely Scan Your Website

If are experiencing website crawling problems, you should read about answer about finding all links and pages in website scan.


Problematic Websites and Specific Website Platforms

Some few website platforms take measures against crawlers they do not recognize to reserve bandwidth and server usage for real visitors and search engines. Here is a list of known solutions for those website platforms:
If you are trying to crawl a forum, check our guide to optimal crawling of forums and blogs with website download.


General Solutions to Problematic Websites

  • Set Scan website | Crawler engine | Max simultaneous connections to one.
  • Set Crawler engine | Advanced engine settings | Default to GET requests to checked/enabled.
  • Then, if necessary, have the webcrawler identify itself as a search engine or as a user surfing.

    • Identify as "user surfing website":
      • In General options | Internet crawler | User agent ID to Mozilla/4.0 (compatible; MSIE 7.0; Win32).
      • In Scan website | Webmaster filters | Download "robots.txt" to unchecked/disabled.
    • Identify as "search engine crawler":
      • In General options | Internet crawler | User agent ID to Googlebot/2.1 (+http://www.google.com/bot.html) or another search engine crawler ID.
      • In Scan website | Webmaster filters | Download "robots.txt" to checked/enabled.
      • In Scan website | Webmaster filters | Obey "robots.txt" file "disallow" directive.
      • In Scan website | Webmaster filters | Obey "robots.txt" file "crawl-delay" directive.
      • In Scan website | Webmaster filters | Obey "meta" tag "robots" noindex.
      • In Scan website | Webmaster filters | Obey "meta" tag "robots" nofollow.
      • In Scan website | Webmaster filters | Obey "a" tag "rel" nofollow.

You can also download our default project file for problematic websites.

If your problem is still not solved, you can try check the solutions used for e.g. NetSuite Websites. You can often apply the same solutions to a wide range of websites.

Webmaster and website software tools


Business and desktop software utilities

Website and webmaster guides


Search engine optimization help

 © Copyright 1997-2012 Microsys