Using WGET to Mirror Websites

GNU WGET can be used to download copies of web sites. This has a number of uses, including allowing you to use local tools (like find and grep) to explore the web site, making historical copies of the web site for archival purposes, and for mirroring web sites, particularly to web hosting platforms that work well with static content (such as web caching servers / accelerators or Google App Engine (see See “Howto: Distributed Static Web Hosting with Google App Engine”)). This article will offers a detailed, step-by-step guide to setting up GNU WGET to make mirrored copies of web sites quickly and easily.

About WGET

The following is from the WGET manual (link to WGET manual)

GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.

Another, slightly less technical description of WGET: GNU Wget is a free piece of software that downloads files like a web browser would.

WGET is a great tool because it lets you automate the downloading of files and web pages from web sites over the Internet. That is, you can write programs — scripts — that can download files for you.

For example, suppose that there is a new data file that you need to download every day for a project you’re working on. Fortunately for us, these files are named by the date. You could use WGET to download these files for you automatically via cron (for UNIX / Linux people):

/path/to/wget http://www.example.com/files/`date +%Y-%m-%d`.csv

This would download the current date’s file (in the form YYYY-MM-DD.csv where YYYY is the year, MM is the month, and DD is the day).

WGET can also be used to mirror entire web sites resulting in your having a complete copy of the target web site. This can be helpful for any number of reasons, from having a local copy to browse offline, to allowing you to use your computer’s search tools (like find and grep) to find specific content, to allowing you to mirror the target web site on your web hosting service (such as Google App Engine).

Mirroring with WGET

To mirror a web site is to download a copy of that web site such that the mirror is a copy of the original.

WGET, by default, will download the one page that you’re asking for and it will save the file exactly the way it found it without any modification. This can be very useful for making a copy of a single page or an archive of the original, but it’s not very helpful when mirroring web sites, especially those with dynamic (i.e. non-static) content.

All of the links in the mirror will point back to their original locations. So, why the first page may come through just fine, if you follow any of the links on that page, they’ll refer you back to the original
If any of those web pages had extensions other than .html (or .htm), your web browser or web server will likely not interpret them correctly. If a web browser requests a URL that has a .php extension, the web server being queried would use PHP to interpret the page and render it’s contents in a form the web browser could understand, presumably with a MIME type that was appropriate for the content. It’s actually more complicated, but it’s sufficient to point out that you would encounter problems

So, our solution is to setup WGET to translate our documents into a form than can be more easily accessed locally. Specifically, we’ll configure it to rewrite the links it finds to point to their local equivalents and we’ll save all web content that doesn’t have .html extension with a .html extension (and rewrite the links accordingly).

The result is a mirror that can be browsed properly with or without a web server!

This also gives us an interesting side-benefit: the mirror becomes only static content that can easily be used as a front-end to the actual web site. This has the potential to greatly speed up web page load times and significantly reduce server load!

Moreover, because static web content isn’t processed by the web server (it’s just read from the filesystem and passed directly along to the browser), you can greatly increase the security of your web server by only providing access to the static copy of your web site. For example, consider the following scenario:

Two web servers are setup, one that can be accessed from the Internet and contains only static web content (i.e. the mirror you’ve made with WGET) and the other which has your Content Management System (CMS) but can only be accessed from your local network. You can even have a Content Delivery Network (CDN) serve your static content (such as Google App Engine; see Howto: Distributed Static Web Hosting with Google App Engine for more information).

With a setup like that, your web site could scale as large as you need while remaining very simple to administer.

Running WGET

GNU WGET is designed to run in a UNIX / Linux environment. It can also be run on a Microsoft Windows system that has the CYGWIN environment installed (see http://www.cygwin.com/).

The Command

The following WGET command will mirror a web site

/path/to/wget \

–recursive \

–level=10 \

–adjust-extension \

–convert-links \

–backup-converted \

–no-host-directories \

–page-requisites \

–timestamping \

–force-html \

–directory-prefix=/path/to/directory/

http://domain.tld/

Automating WGET

You can setup WGET to run as an automated task via cron on your UNIX / Linux system. My preference is to create a short wrapper script that cron can call when required to perform the desired action. A sample wrapper for WGET has been provided for your convenience (link to WGET wrapper script: cp_website.sh).

If I wanted to download the web site at www.example.com once per day at 4am, I would add the following to my crontab:

0 4 * * * $HOME/bin/cp_website.sh “http://www.example.com/” $HOME/archives/example.com/”

This would download “www.example.com” and store it to “archives/example.com/” in my home directory every day at 4am. See the appendix below for notes on the crontab syntax.

Appendix: Notes from the WGET Manual

The following are notes from the WGET manual. The switches, parameters, and options provided under the command may or may not be perfect for your particular application, so here are the entries from the WGET manual that were used in preparing the command mentioned above.

‘-r’, ‘–recursive’

Turn on recursive retrieving. The default maximum depth is 5.

‘-l depth’, ‘–level=depth’

Specify recursion maximum depth level depth

‘-E’, ‘–adjust-extension’

If a file of type ‘application/xhtml+xml’ or ‘text/html’ is downloaded and the URL does not end with the regexp ‘\.[Hh][Tt][Mm][Ll]?’, this option will cause the suffix ‘.html’ to be appended to the local filename. This is useful, for instance, when you’re mirroring a remote site that uses ‘.asp’ pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when you’re downloading CGI-generated materials. A URL like

http://site.com/article.cgi?25

will be saved as:

article.cgi?25.html

Note that filenames changed in this way will be re-downloaded every time you re-mirror a site, because Wget can’t tell that the local X.html file corresponds to remote URL ‘X’ (since it doesn’t yet know that the URL produces output of type ‘text/html’ or ‘application/xhtml+xml’.

As of version 1.12, Wget will also ensure that any downloaded files of type ‘text/css’ end in the suffix ‘.css’, and the option was renamed from ‘–html-extension’, to better reflect its new behavior. The old option name is still acceptable, but should now be considered deprecated.

At some point in the future, this option may well be expanded to include suffixes for other types of content, including content types that are not parsed by Wget.

‘-k’, ‘–convert-links’

After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

Each link will be changed in one of the two ways:

The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link

Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link in doc.html will be modified to point to ‘../bar/img.gif’. This kind of transformation works reliably for arbitrary combinations of directories

The links to files that have not been downloaded by Wget will be changed to include host name and absolute path of the location they point to

Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the link in doc.html will be modified to point tohttp://hostname/bar/img.gif

Because of this, local browsing works reliably: if a linked file was downloaded, the link will refer to its local name; if it was not downloaded, the link will refer to its full Internet address rather than presenting a broken link. The fact that the former links are converted to relative links ensures that you can move the downloaded hierarchy to another directory.

Note that only at the end of the download can Wget know which links have been downloaded. Because of that, the work done by ‘-k’ will be performed at the end of all the downloads.

‘-K’, ‘–backup-converted’

When converting a file, back up the original version with a ‘.orig’ suffix. Affects the behavior of ‘-N’

‘-nH’, ‘–no-host-directories’

Disable generation of host-prefixed directories. By default, invoking Wget with ‘-r http://fly.srk.fer.hr/’ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.

‘-p’, ‘–page-requisites’

This option causes Wget to download all the files that are necessary to properly display a given html page. This includes such things as inlined images, sounds, and referenced stylesheets.
Ordinarily, when downloading a single html page, any requisite documents that may be needed to display it properly are not downloaded. Using ‘-r’ together with ‘-l’ can help, but since Wget does not ordinarily distinguish between external and inlined documents, one is generally left with “leaf documents” that are missing their requisites.

For instance, say document 1.html contains an <IMG> tag referencing 1.gif and an <A> tag pointing to external document 2.html. Say that 2.html is similar but that its image is 2.gif and it links to 3.html. Say this continues up to some arbitrarily high number.

If one executes the command:

wget -r -l 2 http://site/1.html

then 1.html, 1.gif, 2.html, 2.gif, and 3.html will be downloaded. As you can see, 3.html is without its requisite 3.gif because Wget is simply counting the number of hops (up to 2) away from 1.html in order to determine where to stop the recursion. However, with this command:

wget -r -l 2 -p http://site/1.html

all the above files and 3.html’s requisite 3.gif will be downloaded. Similarly,

wget -r -l 1 -p http://site/1.html

will cause 1.html, 1.gif, 2.html, and 2.gif to be downloaded. One might think that:

wget -r -l 0 -p http://site/1.html

would download just 1.html and 1.gif, but unfortunately this is not the case, because ‘-l 0’ is equivalent to ‘-l inf’—that is, infinite recursion. To download a single html page (or a handful of them, all specified on the command-line or in a ‘-i’ URL input file) and its (or their) requisites, simply leave off ‘-r’ and ‘-l’:

wget -p http://site/1.html

Note that Wget will behave as if ‘-r’ had been specified, but only that single page and its requisites will be downloaded. Links from that page to external documents will not be followed. Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:

wget -E -H -k -K -p http://site/document

To finish off this topic, it’s worth knowing that Wget’s idea of an external document link is any URL specified in an <A> tag, an <AREA> tag, or a <LINK> tag other than <LINK REL=”stylesheet”>.

‘-N’, ‘–timestamping’

Turn on time-stamping.

‘-F’, ‘–force-html’

When input is read from a file, force it to be treated as an HTML file. This enables you to retrieve relative links from existing HTML files on your local disk, by adding<base href=”url”> to HTML, or using the ‘–base’ command-line option.

‘-P prefix’, ‘–directory-prefix=prefix’

Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is ‘.’ (the current directory).

Appendix: Notes from the crontab Manual (section 5)

The following are from the crontab manual as they relate to timing events such as the execution of WGET scripts to mirror web sites.

[...] lines in a user crontab have five fixed fields plus a command in the form:

minute hour day-of-month month day-of-week command

Fields are separated by blanks or tabs. The command may be one or more fields long. The allowed values for the fields are:

Field	Allowed Values
minute	* or 0 – 59
hour	* or 0 – 59
day-of-month	* or 1 – 31
month	* or 1-12 or name (first three letters, case-insensitive)
day-of-week	* or 0 – 7 (0 and 7 are both Sundays) or name (first three letters, case-insensitive)
command

A Step-by-Step Guide to Wielding GNU WGET Commands to Generate a Mirrored Copy of a Website

About WGET

Mirroring with WGET

Running WGET

GNU WGET is designed to run in a UNIX / Linux environment. It can also be run on a Microsoft Windows system that has the CYGWIN environment installed (see http://www.cygwin.com/).

The Command

The following WGET command will mirror a web site /path/to/wget \ –recursive \ –level=10 \ –adjust-extension \ –convert-links \ –backup-converted \ –no-host-directories \ –page-requisites \ –timestamping \ –force-html \ –directory-prefix=/path/to/directory/ http://domain.tld/

Automating WGET

Appendix: Notes from the WGET Manual

‘-r’, ‘–recursive’

Turn on recursive retrieving. The default maximum depth is 5.

‘-l depth’, ‘–level=depth’

Specify recursion maximum depth level depth

‘-E’, ‘–adjust-extension’

will be saved as: article.cgi?25.html

‘-k’, ‘–convert-links’

The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link

Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link in doc.html will be modified to point to ‘../bar/img.gif’. This kind of transformation works reliably for arbitrary combinations of directories

The links to files that have not been downloaded by Wget will be changed to include host name and absolute path of the location they point to

Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the link in doc.html will be modified to point tohttp://hostname/bar/img.gif

‘-K’, ‘–backup-converted’

When converting a file, back up the original version with a ‘.orig’ suffix. Affects the behavior of ‘-N’

‘-nH’, ‘–no-host-directories’

Disable generation of host-prefixed directories. By default, invoking Wget with ‘-r http://fly.srk.fer.hr/’ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.

‘-p’, ‘–page-requisites’

‘-N’, ‘–timestamping’

Turn on time-stamping.

‘-F’, ‘–force-html’

When input is read from a file, force it to be treated as an HTML file. This enables you to retrieve relative links from existing HTML files on your local disk, by adding<base href=”url”> to HTML, or using the ‘–base’ command-line option.

‘-P prefix’, ‘–directory-prefix=prefix’

Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is ‘.’ (the current directory).

Appendix: Notes from the crontab Manual (section 5)

Field

Allowed Values

minute

* or 0 – 59

hour

* or 0 – 59

day-of-month

* or 1 – 31

month

* or 1-12 or name (first three letters, case-insensitive)

day-of-week

* or 0 – 7 (0 and 7 are both Sundays) or name (first three letters, case-insensitive)

command

Get in Touch

Spread the Word

Resources

Leveraging Google Drive for Education

Get More Out of Google Groups

Google App Engine How-To No. 1

Services

Company

Connect

will be saved as:

article.cgi?25.html