Microsys
  

How to Download Websites From Internet Archives

If you lost the your old website and its content, it may still be possible to download it from some internet archive websites and services.

Do You Have Permission to Download Your Old Website

If you ever lost your own copyrighted content and websites, e.g. because you forgot to renew webserver and hosting costs, you know how big a problem it can be to rebuild it all.

Some internet archive websites have a policy where they:
  • Download, store and show all websites they find interesting without permission.
  • State you are not allowed to download your own old websites and content from them.

As we are not lawyers, we can not give legal advice, and hence we do not know if you are allowed under e.g. fair use to download your own copyrighted websites and content from such internet archives.

You may want to either:
  • Ask for explicit permission from the archive that has a copy of your website.
  • Contact a lawyer and seek legal advice before you proceed.


How to Configure A1 Website Download

  • Scan website > Paths
    • Set website domain address and/or directory path to the same root domain address the internet archive pages of your old site are located on, e.g. http://content.example.org.
    • In Beyond website root path, initiate scanning from paths add the path to the root of your archived website, e.g. http://content.example.org/archives/timestamp-and-more/http://example.com/.

      Note: The file download path of this URL is also the best starting point to view and surf the downloaded content offline.

  • Scan website > Crawler options
    • Uncheck option Correct "//" when used instead of "/" in internal links.
    • Uncheck option Fix "internal" URLs if website root URL redirects to a different address.

  • Scan website > Crawler Engine
    • Set Max simultaneous connections (data transfer) to 2. We do this because we want to minimize our load on the server that keeps a copy of your website in their archive.

  • Scan website > Analysis filters
    • In limit analysis of internal URLs to which match as "relative path" OR "text" OR "regex" in list add a limit-to that limits which page URLs get downloaded and analyzed. Example could be ::201(0|1)[-0-9A-Za-z_]+/https?://(www\.)?example\.com.

      Note: By adding such filters, you can limit crawl and analysis to the exact parts you need. However, since some archive services redirect pages to other dates and URL versions (e.g. with and without the .www part), your filters should not be too specific.

  • Scan website > Output filters
    • In limit output of internal URLs to which match as "relative path" OR "text" OR "regex" in list add a limit-to that limits which page URLs get downloaded and included in output. Example could be ::201(0|1)[-0-9A-Za-z_]+/http://example\.com.

      Note: Using this requires extra care and is only relevant if you need very finely limit download to the exact parts you need.


While still testing the configuration, you may want to uncheck:
  • Older versions:
    • Scan website | Crawler options | Apply "webmaster" and "output" filters after website scan stops
  • Newer versions:
    • Scan website | Output filters | After website scan stops: Remove URLs excluded
    • Scan website | Webmaster filters | After website scan stops: Remove URLs with noindex/disallow
A1 Website Download
A1 Website Download | help | previous | next
Download and take complete websites with you to browse on offline media. Copy and store entire sites for backup, archive and documentation purposes. Never loose a web site again.
This help page is maintained by
As one of the lead developers, his hands have touched most of the code in the software from Microsys. If you email any questions, chances are that he will be the one answering.
Share this page with friends   LinkedIn   Twitter   Facebook   Pinterest   YouTube  
 © Copyright 1997-2024 Microsys

 Usage of this website constitutes an accept of our legal, privacy policy and cookies information.