Solve Website Scanning and Crawling Problems
Author: Thomas Schulz
Created: 2007-06-22 (yyyy/mm/dd)
Updated: 2008-07-05
Related:
A1 Sitemap Generator -
A1 Website Analyzer -
A1 Website Download -
A1 Keyword Research
Abstract: Scan and crawl websites when creating sitemaps, doing website analysis etc.
|
Solve broken links and broken redirects
Common reasons why fewer pages than expected are found
- The website has links or redirects that use incorrect paths for internal URLs:
- Website has inconsistent usage of www, e.g. http://example.com and http://www.example.com links.
- Website has inconsistent usage of port numbers, e.g. http://www.example.com:80 and http://www.example.com links.
- Multiple domains are used interchangeably, e.g. http://www.example1.com and http://www.example2.com.
- Different from what website uses in its internal links, e.g. www vs non-www.
- Website and/or pages redirect to another domain or address.
- Getting content from another domain, e.g. through <frame> and <iframe>.
- Tip: Check the found extenal links to verify if any of the above is the case.
- Content only links to a limited number of random internal pages. All other links are to external websites.
- Content has no links into a whole area of pages. In this case, having cross-linked all hidden pages is no help.
- Website relies on Javascript or uncommon types of HTML link tags for website navigation, e.g. <iframe>, <form> and <button>.
Note: Although e.g.
A1 Sitemap Generator can be configured to handle above, fixing will improve
search engine indexing.
Common reasons why more pages than expected are found
- Link uses // instead of / and webserver does not respond with error or redirect. Problem cascades if the document linked use relative paths.
- Dynamic page generates unique links based on input from GET ? data. This can sometimes cause an endless loop of unique URLs.
- If you have multiple pages with duplicate content: Make sure to redirect all to the primary url.
Note: Although e.g.
A1 Sitemap Generator can be configured to handle above, fixing will improve
search engine indexing.
Website scan result has URLs with "unknown" response codes
Sometimes URLs have unknown or non-standard

response codes because:
- No incoming links to the URL found during website scan:
- Common state.
- Response code: 0. Description: VirtualItem.
- Cause:
- Crawler encounters /example-directory/example-file.html.
- Crawler encunters no (!) references to /example-directory/.
- Given above, /example-directory/ is never checked for reponse code.
- Solution: Check option Website scan > Crawler options > Always scan directories that contain linked URLs.
- No request has been done by crawler:
- Common state.
- Response code: -1. Description: NoRequest.
- Cause:
- Happens if e.g. scan was paused and has not yet finished crawling.
- Server responded with an unrecognized response code:
- Error state.
- Response code: -2. Description: UnknownResult.
- Request timed out during request:
- Error state.
- Generic timeout response code: -3. Description: TimeoutGeneric.
- Connect timeout response code: -5. Description: TimeoutConnect.
- Read timeout response code: -6. Description: TimeoutRead.
- Solutions:
- Lower amount of simultaneous connections. Increase timeout values. Increase connection retries.
- Resume scan. Crawler will attempt connect again.
- Unknown HTTP request/response communication error:
- Error state.
- Response code: -4. Description: CommError.
- Causes:
- Webserver did not obey HTTP protocol.
- The server/domain of the URL did not exist.
Remember that when creating sitemaps,
it is possible to choose which URL response codes are allowed.

Scan websites with links and pages mirrored across multiple domains
Scan websites that
mirror,
link and
distribute its content across multiple domains:
First configure the primary website directory root, typically the main domain name, and then create a list af website root
aliases.

Scan websites with non-linked hidden sections
Crawl websites that have multiple
areas with no incoming links from within website:
Solution is to initiate website scan from multiple start paths beyond just primary website directory root.

Scan websites with complicated navigation and link elements
Get and find all links that use
Javascript or rare HTML tags for site navigation:
Set crawler options to scan for file references in Javascript,
CSS,
<frame>,
<form> etc.

Solve unstable and slow website scan
Perhaps the server is overloaded. You can try configure your crawler engine so it puts less load on the webserver and waits longer time for response content.
We have written an article about how to
optimize website scan speed and lessen usage of cpu and memory resources.
Log and analyze website crawling issues
If you still experience strange problems spidering your website,
try enable
Scan website > Data collection > Logging of progress.
After website scan, you can find a log file in the program data directory
logs > misc.

The log file can be useful in solving problems related to crawler filters,
robots.txt, no-follow links etc.
You can find out through which page the crawler
first found a specific website section.
2007-07-28 10:56:14
CodeArea: InitLink:Begin
ReferencedFromLink: http://www.example.com/website/
LinkToCheck: http://www.example.com/website/scan.html
|