Multiple Start Search Paths and Root Aliases in Crawler

Our website scraper tool supports mixing multiple domains and additional start scan paths.

Websites with Domain Mirrors or Multiple Root Path Urls

While usually not recommended because of duplicate content issues, some websites mix domains, links and www and non-www usage in URLs. In such cases, after configuring the site root to be scanned, which is usually the primary domain, make a list af root aliases.

domain root path aliases

Note: You need to use the [+] button to add a root path alias into the dropdown list.

In Scan website | Crawler options you can configure the website scraper tool to automatically add common root path aliases:

automatic root path aliases

The Website Root Path and How It Affects Crawling

If you use http://example.com/blogs/ as root, all paths outside (excluding root path aliases) such as e.g. http://example.com/forum/ will neither be included in output nor for analysis.

A better alternative might be to keep the website root as http://example.com/ followed by using analysis filters, output filters and additional start search paths (see below) to control your website crawl and resulting output.

Note: In this case you may need to uncheck option Scan website | Crawler options | Fix internal URLs if website root URL redirects to a different address.

Scan Websites From Multiple Start Search Paths

Websites with site areas that have no incoming links from within the rest of the website can sometimes cause a problem.

Remember that crosslinking hidden pages will not help if none of them are linked from anywhere else in the website.

This problem can easily be overcome in our website scraper software. It is possible to start a website scan from multiple paths in addition to website directory root.

start search paths

In newer versions, there are also buttons for quickly adding additional start search from addresses by:

importing list of URLs from search engines.
importing list of URLs from a file.
importing list of URLs from a website page URL.
addding common urls such as typical xml sitemap paths.

Note: You need to use the [+] button to add additional start scan paths into the dropdown list.

Note: It is often better to make sure your website is cross linked, so crawlers can find all pages on their own.

A1 Website Scraper | help | previous | next

Extract data from sites into CSV files. By scraping websites, you can grab data on websites and transform it into CSV files ready to be imported anywhere, e.g. SQL databases

This help page is maintained by Thomas Schulz

As one of the lead developers, his hands have touched most of the code in the software from Microsys. If you email any questions, chances are that he will be the one answering.