Recursive File Downloading with Wget

Wget is a powerful, non-interactive network downloader. It’s commonly used for mirroring websites or downloading large files from the command line. A particularly useful feature is its ability to recursively download entire directory structures. This tutorial will cover how to use Wget to download directories and their contents, maintaining the original directory structure locally.

Basic Recursive Downloading

The core option for recursive downloading is -r or --recursive. This instructs Wget to follow links found on the initial page and download linked resources.

wget -r http://example.com/directory/

This command will download the page at http://example.com/directory/ and then follow any links it finds to other pages or files, downloading those as well. However, this simple command might not be sufficient for downloading an entire directory structure as intended. It may follow links up the directory tree, or download unnecessary files like index.html files used for directory listing.

Controlling Recursion Depth and Parent Directories

To prevent Wget from ascending to parent directories, use the --no-parent option. This ensures it only downloads files and directories within the specified starting directory.

wget -r --no-parent http://example.com/directory/

You can also control the maximum recursion depth with the -l or --level option. This limits how many levels of subdirectories Wget will traverse. For example, to download only files and directories one level deep:

wget -r -l 1 --no-parent http://example.com/directory/

Excluding Files and Patterns

Often, web servers generate index.html files for directory listings. These are usually not what you want to download. To exclude specific files or patterns, use the -R or --reject option. You can use wildcards for pattern matching.

wget -r --no-parent -R "index.html*" http://example.com/directory/

This command will exclude any files starting with index.html.

Mirroring Entire Websites or Directories

For a more complete mirroring experience, use the -m or --mirror option. This option effectively combines several other options, including recursion, setting timestamps correctly, and converting relative links to absolute ones.

wget -m http://example.com/directory/

This is often the most convenient option for downloading an entire directory structure, preserving the original website’s structure.

Handling `robots.txt`

Websites often use a robots.txt file to instruct web crawlers (like Wget) which parts of the site should not be accessed. By default, Wget respects this file. If you need to ignore it (for example, if you’re downloading a site for local testing), you can disable robots.txt parsing with the -e robots=off option. Be mindful of website terms of service and respect robots.txt whenever possible.

wget -e robots=off -m http://example.com/directory/

Adjusting Directory Structure Locally

Sometimes you want to control how the downloaded files are organized locally. The --cut-dirs option allows you to remove a specified number of leading directory components from the downloaded structure. For example:

wget -r --no-parent --cut-dirs=2 http://example.com/a/b/c/

If the remote directory structure is /a/b/c/data/files, the downloaded files will be placed directly into a data/files directory locally, effectively removing the /a and /b directories from the structure.

User Agent Spoofing

Some websites might block requests from Wget based on its default user agent. You can spoof the user agent to appear as a regular web browser:

wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" http://example.com/

This can help to bypass restrictions and ensure successful downloading.

By combining these options, you can precisely control how Wget downloads files and directories, ensuring that you get the exact content you need in the desired structure.