Recursive File Downloading with Wget
Wget is a powerful, non-interactive network downloader. It’s commonly used for mirroring websites or downloading large files from the command line. A particularly useful feature is its ability to recursively download entire directory structures. This tutorial will cover how to use Wget to download directories and their contents, maintaining the original directory structure locally.
Basic Recursive Downloading
The core option for recursive downloading is -r
or --recursive
. This instructs Wget to follow links found on the initial page and download linked resources.
wget -r http://example.com/directory/
This command will download the page at http://example.com/directory/
and then follow any links it finds to other pages or files, downloading those as well. However, this simple command might not be sufficient for downloading an entire directory structure as intended. It may follow links up the directory tree, or download unnecessary files like index.html
files used for directory listing.
Controlling Recursion Depth and Parent Directories
To prevent Wget from ascending to parent directories, use the --no-parent
option. This ensures it only downloads files and directories within the specified starting directory.
wget -r --no-parent http://example.com/directory/
You can also control the maximum recursion depth with the -l
or --level
option. This limits how many levels of subdirectories Wget will traverse. For example, to download only files and directories one level deep:
wget -r -l 1 --no-parent http://example.com/directory/
Excluding Files and Patterns
Often, web servers generate index.html
files for directory listings. These are usually not what you want to download. To exclude specific files or patterns, use the -R
or --reject
option. You can use wildcards for pattern matching.
wget -r --no-parent -R "index.html*" http://example.com/directory/
This command will exclude any files starting with index.html
.
Mirroring Entire Websites or Directories
For a more complete mirroring experience, use the -m
or --mirror
option. This option effectively combines several other options, including recursion, setting timestamps correctly, and converting relative links to absolute ones.
wget -m http://example.com/directory/
This is often the most convenient option for downloading an entire directory structure, preserving the original website’s structure.
Handling robots.txt
Websites often use a robots.txt
file to instruct web crawlers (like Wget) which parts of the site should not be accessed. By default, Wget respects this file. If you need to ignore it (for example, if you’re downloading a site for local testing), you can disable robots.txt
parsing with the -e robots=off
option. Be mindful of website terms of service and respect robots.txt
whenever possible.
wget -e robots=off -m http://example.com/directory/
Adjusting Directory Structure Locally
Sometimes you want to control how the downloaded files are organized locally. The --cut-dirs
option allows you to remove a specified number of leading directory components from the downloaded structure. For example:
wget -r --no-parent --cut-dirs=2 http://example.com/a/b/c/
If the remote directory structure is /a/b/c/data/files
, the downloaded files will be placed directly into a data/files
directory locally, effectively removing the /a
and /b
directories from the structure.
User Agent Spoofing
Some websites might block requests from Wget based on its default user agent. You can spoof the user agent to appear as a regular web browser:
wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" http://example.com/
This can help to bypass restrictions and ensure successful downloading.
By combining these options, you can precisely control how Wget downloads files and directories, ensuring that you get the exact content you need in the desired structure.