Microsys
        

Crawl Login Password Protected Pages with A1 Website Download

Session Cookie Login Demo Project

We have created a demo project that test crawler login support for websites that use session cookies. Session cookies is the most commonly used method for website login systems. Most of these website logins use POST method for transferring login and user data. It is what PHP defaults to when using start_session. (Another popular login mechanism is "Basic realm authentication" which is also supported by our website crawler.)

Note: Unfortunately A1 Website Download currently only supports session cookie login across http and not https.

You can online test or download zipped demo website with login support. For immediate testing, download the zipped demo project file as well.

The username and password required to login successfully is highlighted on the login page.

  1. First test manually that login support works:
    manually check login
    Notice how all pages after login all state user is logged in.

  2. We configure the website crawl root directory:
    path configuration
    This is done in Scan website > Paths.

  3. We check the source of the login page:
    html login form source
    • You can View source in e.g. FireFox.
    • If the URL in <form> tag action attribute is empty, it means the action destination URL is the same as the login page URL.
    • The name attribute in the <input> tags vary from website to website.


  4. We configure the login options:
    login configuration
    This is done in Scan website > Crawler identification.

    • When manually testing login you can use a FireFox plugin called Live HTTP Headers to see the headers transferred during the login process. Here is how to do:
      • Get FireFox Live HTTP Headers plugin
        • Try make a website login in FireFox browser.
        • Focus on the HTTP header data from the first page.
        • Notice the website address FireFox connects to.
        • Notice the content (POST data query string) it sends.

      • Check the HTML source of the website login page
        • Search for <form> and <input> tags related to website login.

      • Use this data to configure headers to send.
    • If you still have trouble getting login to work, you can also use the general protocol analysis tool WireShark.


  5. We need to filter out all URLs that will cause website logout during crawl:
    ignore logout paths
    This is done in Scan website > Crawler filters.

  6. Start website scan. An easy way to test and verify login works is by using A1 Website Download. Just view the downloaded pages: They should all state logged in.

Website software tools


Business software utilities


Popular freeware programs

Online tools


Webmaster articles


Website promotion resources

 © Copyright 1997-2010 Microsys | about | contact | legal | privacy