Microsys

Same Website Crawl: Static vs Dynamic Database (e.g. PHP or ASP)

Website Crawlers Can not Tell

  • Using virtual directories with URL rewriting? Site crawlers do not care if directories in URLs are physical on the disk or virtual.

  • Websites using Cold Fusion, ASP.Net, JSP, PHP etc. as server side language has no consequence. Website crawlers only see the HTML generated by the server-side language.

    Note: In settings, the crawler in our site analysis tool can be set to accept/ignore URLs with certain file extensions. If you have troubles, read about finding all pages and links.

  • Sites dynamicly generated by scripts and databases are crawled without problems by site crawlers and robots.

    Note: Some search engine robots may slow when crawling URLs with ?. However, that is mainly because search engine are worried spending crawling resourced on lots of URL with auto generated content. To mitigate this, you can use mod rewrite or similar in your website.

    Note: Our sitemap generator and the MiggiBot crawler engine does not care about how URLs look.



Verify Website HTML Output to Crawlers and Browsers

Normally, websites never cloak content based on user agent string and IP address. However, by setting the useragent ID, you can check the HTML source search engines and browsers see when retrieving pages form your website.


How to Successfully and Completely Scan Your Website

If are experiencing website crawling problems, you should read about answer about finding all links and pages in website scan.


Problematic Websites and Specific Website Platforms

Some few website platforms take measures against crawlers they do not recognize to reserve bandwidth and server usage for real visitors and search engines. Here is a list of known solutions for those website platforms:
If you are trying to crawl a forum, check our guide to Forum Sitemaps with Sitemap Generator.


General Solutions to Problematic Websites

  • Set Scan website | Crawler engine | Max simultaneous connections to one.
  • Set Crawler engine | Advanced engine settings | Default to GET requests to checked/enabled.
  • Then, if necessary, have the webcrawler identify itself as a search engine or as a user surfing.

    • Identify as "user surfing website":
      • In Scan website | Crawler identification | User agent ID to Mozilla/4.0 (compatible; MSIE 7.0; Win32).
      • In Scan website | Webmaster filters | Download "robots.txt" to unchecked/disabled.
    • Identify as "search engine crawler":
      • In Scan website | Crawler identification | User agent ID to Googlebot/2.1 (+http://www.google.com/bot.html) or another search engine crawler ID.
      • In Scan website | Webmaster filters | Download "robots.txt" to checked/enabled.
      • In Scan website | Webmaster filters | Obey "robots.txt" file "disallow" directive.
      • In Scan website | Webmaster filters | Obey "robots.txt" file "crawl-delay" directive.
      • In Scan website | Webmaster filters | Obey "meta" tag "robots" noindex.
      • In Scan website | Webmaster filters | Obey "meta" tag "robots" nofollow.
      • In Scan website | Webmaster filters | Obey "a" tag "rel" nofollow.

You can also download our default project file for problematic websites.

If your problem is still not solved, you can try check the solutions used for e.g. NetSuite Websites. You can often apply the same solutions to a wide range of websites.

Website software tools


Business software utilities


Popular freeware programs

Online tools


Webmaster articles


Website promotion resources

 © Copyright 1997-2010 Microsys | about | contact | legal | privacy