URL Rewriting and Similar
Your website can without problems use virtual directories
or URL rewriting
Many people use Apache mod_rewrite
to create virtual directories and URLs.
Similar solutions exist for nearly all webserver solutions.
From a crawler perspective, websites can safely use virtual directories and URLs.
Browsers, search engine bots etc. all
view your website from the outside
They have no knowledge of how your URL structure is implemented
They can not tell if pages or directories are physical
Server Side Language and HTML Web Pages
In a modern website, there is often little or no correlation between the URL "file names"
and the underlying data including how it is generated, stored and retrieved.
It does not matter if a website uses Cold Fusion, ASP.Net, JSP, PHP or similar as its server side programming language.
In settings, the crawler in our site analysis tool can be set to
accept/ignore URLs with certain file extensions
and MIME content types
If you have troubles, read about
finding all pages and links
Dynamically Created Content on Server
Sites that dynamically generate page content using server-side scripts and databases are crawled without problems by site crawlers and robots.
Some search engine robots may slow when crawling URLs with ?
However, that is mainly because search engine are worried spending crawling resourced on lots of URLs with auto generated content.
To mitigate this, you can use mod rewrite
or similar in your website.
Our website scraper and the
MiggiBot crawler engine
does not care about how URLs look.
Many websites nowadays use responsive
layouts that adjust themselves in the browser
using client-side technologies, e.g. CSS
However, some websites have special website URLs for:
- Feature phones that only support WAP and similar old technologies.
- Smartphones with browsers that are very similar to desktop browsers and render content the same way.
- Desktop computers, laptops and tablets where the screen area and view port is larger.
Generally, such mobile optimized websites know they need to output content optimized for mobile devices by either:
- Assume they should always output content opimized for a given set of mobile devices, e.g. smart phones.
- Perform server-side checks on the user agent passed to it by the crawler or browser. Then, if a mobile device is identified, it will eiher redirect to a new URL or simply output content optimized for mobile devices.
If you want the A1 Website Scraper crawler to see the mobile content and URLs your website outputs to mobile devices, simply change the setting
General options and tools | Internet crawler | User agent ID
to one used by popular mobile devices, e.g this:
Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19
You can do the same with most desktop browsers by installing a user agent switcher plugin
That allows you to inspect the code returned by your website to mobile browsers.
If your mobile optimized website uses a mix of client-side
technologies such as AJAX
the user agent and alter content based on it, it will not work on many website crawlers
including, as of 2014 September at least, A1 Website Scraper
code which can query the browser for the user agent ID.
without changing URL address, it is worth knowing that crawlability will depend on the exact implementation.
Explanation of fragments in URLs:
Relative links within a page:
server-side code and replaces content-in-browser:
AJAX-fragments-Google-initiative: Part of the Google initiative
Making AJAX Applications Crawlable:
This solution has since been deprecated by Google themselves. For more information see:
If you use this solution, you will see URLs containing
_escaped_fragment_ when crawled.
To help successfully crawl AJAX websites:
- Select an AJAX enabled crawler option in Scan website | Crawler engine.
- Enable option Scan website | Crawler options | Try search inside JSON.
Normally, websites never cloak
content based on
user agent string
and IP address
However, by setting the useragent ID
you can check the
search engines and browsers see when retrieving pages form your website.
This can also be used to test if a website responds correctly to a crawler/browser that identifies itself as being mobile.
Website bandwidth filtering and/or throttling
website platforms and module take measures against crawlers they do not recognize to
reserve bandwidth and server usage for real visitors and search engines.
Here is a list of known solutions for those website platforms:
If you are trying to crawl a forum, check our guide to optimal crawling of
forums and blogs with website scraper
Website Erratically Sends The Wrong Page Content
We have seen a few cases where
the website, server, CDN
or cache system suffered
from a bug and sent the wrong output page content when being crawled.
To prove and diagnose such a problem,
download and configure A1 Website Download
- Set Scan website | Download options | Convert URL paths in downloaded content to no conversion.
- Enable Scan website | Data collection | Store redirects and links from and to all pages.
- Enable all options in Scan website | Webmaster filters.
You can now compare the downloaded page source code
with what is reported in the A1 Website Download
and see if the webserver/website sent correct or incorrect content to the A1 crawler engine
To solve such an issue without access to the website and webserver code,
try use some of the configurations suggested further down below.
If you encounter a website that throttles crawler requests, blocks certain user agents or is very slow you will often get response codes such as:
- 403 : Forbidden
- 503 : Service Temporarily Unavailable
- -5 : TimeoutConnectError
- -6 : TimeoutReadError
To solve these try the following:
Set Scan website | Crawler engine | Max simultaneous connections to one.
Set Crawler engine | Advanced engine settings | Default to GET requests to checked/enabled.
Then, if necessary, have the webcrawler identify itself as a search engine or as a user surfing.
- Identify as "user surfing website":
- Set General options and tools | Internet crawler | User agent ID to Mozilla/4.0 (compatible; MSIE 7.0; Win32).
- Set Scan website | Webmaster filters | Download "robots.txt" to unchecked/disabled.
- In Scan website | Crawler engine increase the amount of time between active connections.
- Optional:: Set Scan website | Crawler engine to HTTP using WinInet engine and settings (Internet Explorer)
- Identify as "search engine crawler":
- Set General options and tools | Internet crawler | User agent ID to Googlebot/2.1 (+http://www.google.com/bot.html) or another search engine crawler ID.
- Set Scan website | Webmaster filters | Download "robots.txt" to checked/enabled.
- Set Scan website | Webmaster filters | Obey "robots.txt" file "disallow" directive to checked/enabled.
- Set Scan website | Webmaster filters | Obey "robots.txt" file "crawl-delay" directive to checked/enabled.
- Set Scan website | Webmaster filters | Obey "meta" tag "robots" noindex to checked/enabled.
- Set Scan website | Webmaster filters | Obey "meta" tag "robots" nofollow to checked/enabled.
- Set Scan website | Webmaster filters | Obey "a" tag "rel" nofollow to checked/enabled.
If you continue to have problems, you can combine the above with:
Set Scan website | Crawler engine | Crawl-delay in miliseconds between connections to at least 3000.
If your IP address has been blocked, you can try use General options and tools | Internet crawler | HTTP proxy settings
Proxy support depends on which HTTP engine has been selected in Scan website | Crawler engine
If the problem is timeout errors, you can also try do repeat crawls with the resume scan
You can also download our default project file
for problematic websites
as you can often apply the same solutions to a wide range of websites.