Crawl and Download Websites with Login Password Protected Pages
Author: Thomas Schulz
Created: 2008-06-08 (yyyy/mm/dd)
Updated: 2009-12-01
Related:
A1 Sitemap Generator -
A1 Website Analyzer -
A1 Website Download -
A1 Keyword Research
Abstract: Scan and crawl websites that require login username and password.
|
Session cookie login demo project
We have created a demo project that test crawler login support for websites that use
session cookies.
Session cookies is the most commonly used method for website login systems. It is what PHP defaults to when using
start_session.
You can
online test or
download zipped demo website with login support. For immediate testing, download the
zipped demo project file as well.
The username and password required to login successfully is highlighted on the
login page.
- First test manually that login support works:

Notice how all pages after login all state user is logged in.
- We configure the website crawl root directory:

This is done in
Scan website > Paths.
-
We check the source of the login page:

- You can View source in e.g. FireFox.
-
If the URL in <form> tag action attribute is empty,
it means the action destination URL is the same as the login page URL.
- The name attribute in the <input> tags vary from website to website.
-
We configure the login options:

This is done in
Scan website > Crawler identification.
-
- When manually testing login you can use a FireFox plugin called Live HTTP Headers to see the headers transferred during the login process.
- If you still have trouble getting login to work, you can also use the general protocol analysis tool WireShark.
-
We need to filter out all URLs that will cause website logout during crawl:

This is done in
Scan website > Crawler filters.
-
Start website scan. An easy way to test and verify login works is by using A1 Website Download. Just view the downloaded pages: They should all state logged in.
XML Sitemaps and password protected pages
Restricting access to parts of your website can often have benefits. However, one downside is that it complicates
e.g.
creating sitemaps for your registered users with access to all parts of your website. This guide has shown how to solve this using our
sitemap generator.
However, if you are trying to
create XML sitemaps
to attract Google and other search engines, you will still need to give these search engines access to you password protected pages.
Study these resources for possible solutions: