How to Download Websites From Internet Archives
If you lost the your old website and its content, it may still be possible to download it from some internet archive websites and services.
If you ever lost your own copyrighted content and websites,
e.g. because you forgot to renew webserver and hosting costs,
you know how big a problem it can be to rebuild it all.
Some internet archive websites have a policy where they:
- Download, store and show all websites they find interesting without permission.
- State you are not allowed to download your own old websites and content from them.
As we are not lawyers, we can not give legal advice, and hence we do
not know if you are allowed under e.g.
fair use
to download your own copyrighted websites and content from such internet archives.
You may want to either:
- Ask for explicit permission from the archive that has a copy of your website.
- Contact a lawyer and seek legal advice before you proceed.
-
Scan website > Paths
- Set website domain address and/or directory path to
the same root domain address the internet archive pages of your old site are located on, e.g.
http://content.example.org.
- In Beyond website root path, initiate scanning from paths
add the path to the root of your archived website, e.g. http://content.example.org/archives/timestamp-and-more/http://example.com/.
Note: The file download path of this URL is also the best starting point to view and surf the downloaded content offline.
-
Scan website > Crawler options
- Uncheck option Correct "//" when used instead of "/" in internal links.
- Uncheck option Fix "internal" URLs if website root URL redirects to a different address.
-
Scan website > Crawler Engine
- Set Max simultaneous connections (data transfer) to 2.
We do this because we want to minimize our load on the server that keeps a copy of your website in their archive.
-
Scan website > Analysis filters
- In limit analysis of internal URLs to which match as "relative path" OR "text" OR "regex" in list
add a limit-to that limits which page URLs get downloaded and analyzed. Example could be
::201(0|1)[-0-9A-Za-z_]+/https?://(www\.)?example\.com.
Note: By adding such filters, you can limit crawl and analysis to the exact parts you need. However, since some archive services redirect pages to other dates and URL versions (e.g. with and without the .www part), your filters should not be too specific.
-
Scan website > Output filters
- In limit output of internal URLs to which match as "relative path" OR "text" OR "regex" in list
add a limit-to that limits which page URLs get downloaded and included in output. Example could be
::201(0|1)[-0-9A-Za-z_]+/http://example\.com.
Note: Using this requires extra care and is only relevant if you need very finely limit download to the exact parts you need.
While still testing the configuration, you may want to uncheck:
- Older versions:
- Scan website | Crawler options | Apply "webmaster" and "output" filters after website scan stops
- Newer versions:
- Scan website | Output filters | After website scan stops: Remove URLs excluded
- Scan website | Webmaster filters | After website scan stops: Remove URLs with noindex/disallow