How Crawler Handles Hard and Soft 404 Not found URLs
A1 Website Search Engine has en option to crawl error pages for links since our software has built-in protections against crawling endless error pages.
Generally speaking, crawling URLs that error with e.g. 404 - Not Found
is a bad idea. To understand the reason, take a look at the following example of the process in a naive website crawler handling relative broken links:
- Crawler detects url http://www.example.com/directory/ gives 404 - not found.
- Crawler finds http://www.example.com/directory/ links to directory/something.
- Crawler concatenates http://www.example.com/directory/ and directory/something into http://www.example.com/directory/directory/something.
- Crawler detects url http://www.example.com/directory/directory/ gives 404 - not found.
- Crawler finds http://www.example.com/directory/directory/ links to directory/something.
- Crawler concatenates http://www.example.com/directory/directory/ and directory/something into http://www.example.com/directory/directory/directory/something.
- Classic spider trap where the website crawl will continue forever.
This is why most crawlers by default do not
continue crawling pages that return 404 - Not Found
Some websites include important links in pages returned for e.g. 404 - not found
errors. You can force
A1 Website Search Engine
to scan error pages for links
by checking option: scan website | crawler options | crawl error pages
Please note that links relative to current path
will be ignored when analysing error pages to avoid getting caught in an endless crawling loop.
If necessary to have error page URLs scanned for links, use one of the following kinds of links instead: