Microsys
  

Website Scraper for Login Password Protected Pages

Scan and crawl website with website scraper, even if the website requires login username and password.
Help: overview | previous | next

 To see all the options available, you will have to switch off easy mode 

 With options that use a dropdown list, any [+] or [-] button next to adds or removes items in the list itself 

Login Support for HTTPs Websites

If your website uses HTTPs, you may need to configure A1 Website Scraper for this.

For more information, see this help page about https.


Always Configure First: URL Exclude Filters

Important: If you perform a user login, it is very important to make sure and verify yourself that the crawler does not follow links that can delete or alter content.

You can do this two ways:
  • Have a user account that can not edit or delete any content or settings. (Safest.)
  • Limit the crawer to not follow any unwanted links. (Unsafe.)

Note It is also important to avoid the crawler follows a logout link since the crawler otherwise will log out by itself.

You can control which URLs A1 Website Scraper fetch during the website crawl by excluding them in analysis filters and output filters.

Be sure to test that your filters are configured correctly and work as intended. Do also note we can not take responsibility if something goes wrong - either with the configuration or in the software.

Note: There is a preset available in Scan website | Quick presets... that can help exclude some common patterns of unwanted URLs:

exclude common unwanted URLs when using login functionality


Website Login - Using Windows Internet Explorer to Login

This is the easiest login method to use since it requires the least configuration. However, it only works on Windows.

In Scan website | Crawler engine select HTTP using WinInet engine and settings (Internet Explorer).

Each time you want to initiate the website scan, do the following:
  1. In Scan website | Crawler login click the button Open embedded browser and login before crawl.
  2. Navigate to the login section of your website and login like you normally would.
  3. You can now close the embedded browser window.

login using the embedded browser

This combination will ensure that A1 Website Scraper has access to all cookies transferred during the login. You can now start the website scan.


Website Login: Protocol Based Login and Authentication Methods

In addition to the most common session cookies and post form website login method (which most the rest of the help is about) there are some other popular logion mechanisms called NTLM, SSPI, Digest and Basic Realm Authentication. While support for some of these login methods are still work-in-progress, they can sometimes be used for website login.

You can recognize websites that use this though login dialogs like this:

website login dialog with basic realm authentication

It is very easy to configure the crawler in our website scraper software for this login method:

website login configuration with basic realm authentication

If above does not work, you can try set Default path and type handler in Scan website | Crawler engine from HTTP using Indy engine for internet and localhost to HTTP using WinInet engine and settings (Internet Explorer) and login with Internet Explorer before starting the website crawl with our website scraper tool.

Note: Rest of the help is about post form website login.


Website Login - Post Form / Session Cookies: FireFox Live HTTP Headers

Ensure that the following configuration is done:
  • Set option Default path and type handler in Scan website | Crawler engine to HTTP using Indy engine for internet and localhost.

    choose HTTP communication handler

At this point, it is the HTTP using Indy engine for internet and localhost option that supports session cookies best. (Something that many login systems depend on.)

When manually testing login you can use a FireFox plugin called Live HTTP Headers to see the headers transferred during the login process:

Get FireFox Live HTTP Headers plugin:
  • Clear all HTTP headers already collected.
  • Try make a website login in FireFox browser.
  • Now focus on the logged HTTP header data from the first entry / page.
  • Notice the website address FireFox connects to.
  • Notice the content (POST data query string) it sends.
  • Use this data to configure headers to send.

login using firefox live http headers

Having done that, you just copy-and-paste the appropriate values into the A1 Website Scraper login configuration:

copy and paste login data

If you are looking for an alternative to FireFox Live HTTP Headers you can check out Fiddler (for Internet Explorer) and WireShark (general tool).


Website Login - Post Form / Session Cookies: Details and Demo Project

We have created a demo project that test crawler login support for websites that use session cookies.

Session cookies is the most commonly used method for website login systems. Most of these website logins use POST method for transferring login and user data. It is what PHP defaults to when using start_session.

You can online test or download zipped demo website with login support. For immediate testing, download the zipped demo project file as well.

The username and password required to login successfully is highlighted on the login page.

  1. First test manually that login support works:

    manually check login

    Notice how all pages after login all state user is logged in.

  2. We configure the website crawl root directory:

    path configuration

    This is done in Scan website | Paths.

  3. We check the source of the login page:

    html login form source

    • You can View source in e.g. FireFox.
    • Search for <form> and <input> tags related to website login.
    • If the URL in <form> tag action attribute is empty, it means the action destination URL is the same as the login page URL.
    • The name attribute in the <input> tags vary from website to website.


  4. We configure the login options:

    login configuration

    This is done in Scan website | Crawler identification.

  5. We need to filter out all URLs that will cause website logout during crawl:

    ignore logout paths

    This is done in Scan website | Analysis filters and Scan website | Output filters.

  6. Start website scan. An easy way to test and verify login works is by using A1 Website Download. Just view the downloaded pages: They should all state logged in.



Website Login - Post Form / Session Cookies: Known Problems and Issues

Login concepts known to cause problems:
  1. Upon first login a unique calculated value is passed in the login form: Example could be Javascript code that based on e.g. exact time, IP address, browser user agent ID etc. calculates a hash value passed in login form. The server knows the algotihm with which the hash was generated and validates it server-side.

Above makes it almost impossible to get website scraper login working correctly unless you have direct access to the website and know the intrinsics very well.

Known systems to cause problems:
  1. Some ASP.Net login forms

    You can identify ASP.Net login forms by search the HTML output for the string: name="__VIEWSTATE".

    Pure speculation and work in progress:
    Possibly "viewstate" becomes incorrect even when copying the entire POST/data/headers transferred during manual login (and copied using e.g. FireFox Live HTTP Headers). A possible explanation is that "viewstate" contains a "hash" like verification value much like explained above as #1 in problematic login concepts.


Alternative for Crawling Login Based Websites

If you own the website, you can code it in a way that gives full access to crawlers with specific user agent strings.

You can configure this in General options | Internet crawler | User agent ID:

configure user agent id
This help page is maintained by

As one of the lead developers, his hands have touched most of the code in the software from Microsys.

If you email any questions, chances are that he will be the one answering them.
A1 Website ScraperAbout A1 Website Scraper

Extract data from sites into CSV files. By scraping websites, you can grab data on websites and transform it into CSV files ready to be imported anywhere, e.g. SQL databases
     
share   LinkedIn   Twitter   Facebook   Pinterest   Google+   YouTube  
 © Copyright 1997-2016 Microsys
 Usage of this website constitutes an accept of our legal, privacy and cookies information.