Microsys
  

Crawling hard and soft 404 "not found" URLs with A1 Website Analyzer

A1 Website Analyzer has en option to crawl error pages for links since our software has built-in protections against crawling endless error pages.
Help: overview | previous | next

 To see all the options available, you will have to switch off easy mode 

 With options that use a dropdown list, any [+] or [-] button next to adds or removes items in the list itself 

Why crawling "404 - not found" page URLs is problematic

Generally speaking, crawling URLs that error with e.g. 404 - Not Found is a bad idea. To understand the reason, take a look at the following example of the process in a naive website crawler handling relative broken links:
    • Crawler detects url http://www.example.com/directory/ gives 404 - not found.
    • Crawler finds http://www.example.com/directory/ links to directory/something.
    • Crawler concatenates http://www.example.com/directory/ and directory/something into http://www.example.com/directory/directory/something.
    • Crawler detects url http://www.example.com/directory/directory/ gives 404 - not found.
    • Crawler finds http://www.example.com/directory/directory/ links to directory/something.
    • Crawler concatenates http://www.example.com/directory/directory/ and directory/something into http://www.example.com/directory/directory/directory/something.
    • Classic spider trap where the website crawl will continue forever.

This is why most crawlers by default do not continue crawling pages that return 404 - Not Found.


A1 Website Analyzer can crawl 404 pages

Some websites include important links in pages returned for e.g. 404 - not found errors. You can force A1 Website Analyzer to scan error pages for links by checking option: scan website | crawler options | crawl error pages.

Please note that links relative to current path will be ignored when analysing error pages to avoid getting caught in an endless crawling loop.

If necessary to have error page URLs scanned for links, use one of the following kinds of links instead:
  • /directory/something
  • http://www.example.com/directory/something


Soft 404 errors and how to avoid them

If your website correctly returns HTTP response 404 - Not Found for a non-existing URL, it is called a hard 404 error. Conversely, a soft 404 error is when your website instead incorrectly responds with e.g. HTTP response 200 - Found.

The reason soft errors are problematic for crawlers are that they create spider traps similar to what described earlier above.

Note: Even if your page visibly states "not found" in page content text for URLs and pages that do not exist, you need to make sure your website actually returns HTTP response code 404 - Not Found and not e.g. 200 - Found.
This help page is maintained by

As one of the lead developers, his hands have touched most of the code in the software from Microsys.

If you email any questions, chances are that he will be the one answering them.
A1 Website AnalyzerAbout A1 Website Analyzer

SEO website crawler tool that can find broken links, analyze internal link juice flow, show duplicate titles, perform custom code/text search and much more.
Share this page with friends   LinkedIn   Twitter   Facebook   Pinterest   Google+   YouTube  
 © Copyright 1997-2017 Microsys

 Usage of this website constitutes an accept of our legal, privacy and cookies information.