Login Across HTTPS Instead of HTTP
Note:
In default configuration A1 Sitemap Generator only supports session cookie login across
http and not https.
To get better https with session cookies support in our software, see General options | Tool paths.
Session Cookie Login Guide Using Demo Project
We have created a demo project that test crawler login support for websites that use session cookies.
Session cookies is the most commonly used method for website login systems. Most of these website logins use POST method for transferring login and user data. It is what PHP defaults to when using start_session.
(Another popular login mechanism is "Basic realm authentication" which is also supported by our website crawler.)
You can online test or download zipped demo website with login support.
For immediate testing, download the zipped demo project file as well.
The username and password required to login successfully is highlighted on the
login
page.
- First test manually that login support works:

Notice how all pages after login all state user is logged in.
- We configure the website crawl root directory:

This is done in
Scan website > Paths.
-
We check the source of the login page:

- You can View source in e.g. FireFox.
- Search for <form> and <input> tags related to website login.
-
If the URL in <form> tag action attribute is empty,
it means the action destination URL is the same as the login page URL.
- The name attribute in the <input> tags vary from website to website.
-
We configure the login options:

This is done in
Scan website > Crawler identification.
-
We need to filter out all URLs that will cause website logout during crawl:

This is done in
Scan website > Crawler filters.
-
Start website scan. An easy way to test and verify login works is by using
A1 Website Download.
Just view the downloaded pages: They should all state logged in.
Alternative Method: View and Copy Live HTTP Headers
When manually testing login you can use a FireFox plugin called Live HTTP Headers to see the headers transferred during the login process:

Get FireFox Live HTTP Headers plugin:
- Clear all HTTP headers already collected.
- Try make a website login in FireFox browser.
- Focus on the HTTP header data from the first entry / page.
- Notice the website address FireFox connects to.
- Notice the content (POST data query string) it sends.
- Use this data to configure headers to send.
If you still have trouble getting login to work, you can also use the general protocol analysis tool WireShark.
Known Sitemap Generator Login Problems
Login concepts known to cause problems:
- Upon first login a unique calculated value is passed in the login form: Example could be Javascript code that based on e.g. exact time,
IP address, browser user agent ID etc. calculates a hash value passed in login form. The server knows the algotihm with which the hash was
generated and validates it server-side.
Above makes it almost impossible to get sitemap generator login working correctly unless you have direct access to the website and know the intrinsics very well.
Known systems to cause problems:
- Some ASP.Net login forms
You can identify ASP.Net login forms by search the HTML output for the string: name="__VIEWSTATE".
Pure speculation and work in progress:
Possibly "viewstate" becomes incorrect even when copying the entire POST/data/headers transferred during manual login (and copied using e.g. FireFox Live HTTP Headers).
A possible explanation is that "viewstate" contains a "hash" like verification value much like explained above as #1 in problematic login concepts.
More Sitemap Generator and Website Login Resources
For some demo website login configurations of common website platforms, see examples of crawling forums.
XML Sitemaps and Password Protected Pages
Restricting access to parts of your website can often have benefits. However, one downside is that it complicates
e.g.
creating sitemaps
for your registered users with access to all parts of your website. This guide has shown how to solve this using our
sitemap generator
However, if you are trying to
create XML sitemaps
to attract Google and other search engines, you will still need to give these search engines access to you password protected pages.
Study these resources for possible solutions:
|