1. Home
  2. /
  3. Projects
  4. /
  5. RWDL: Fast, Reliable Bulk File Downloader for Web Directories

RWDL: Fast, Reliable Bulk File Downloader for Web Directories

Published
July 30, 2025
Categories
Web ScrapingCLI ToolPythonDownloader
Recursive Web Directory Downloader CLI screenshot

Have you ever needed to download dozens or even thousands of files from a web directory only to find that doing it by hand takes ages? I have. That's why I built Recursive Web Directory Downloader (rwdl): a simple CLI tool that makes downloading in bulk from sites like archive.org, software mirrors, or any Apache-style directory listing fast and easy.

Why I Built RWDL

The inspiration for rwdl came from my own frustration. I often needed to mirror large collections of files from archive.org, research datasets, and up-to-date parrotOS ISOs. Existing tools were either too complex, too slow, or didn't handle recursion and filtering the way I wanted (or I haven't found one that does). I wanted something that:

  • Just works with a single command
  • Lets me filter by file extension (e.g., only .iso or .exe)
  • Recursively grabs files while preserving directory structure
  • Skips files I already have (for resuming interrupted downloads)
  • Is cross-platform and easy to install

So I built rwdl for my needs, and it's open source for everyone! πŸŽ‰

What Does RWDL Do?

rwdl is a command-line Python script that:

  • Recursively traverses web directories (with configurable depth)
  • Downloads only the file types you want (by extension)
  • Skips navigation and non-file links
  • Mirrors the remote directory structure locally
  • Avoids re-downloading files you already have
  • Lets you control request delays (to not bombard servers with requests)
  • Works on Windows, Linux, and macOS (or any toaster that can run Python)

It's perfect for archiving, mirroring, or just grabbing a bunch of files from any site that lists them in a directory format.

How to Use RWDL

Installation

You'll need Python 3.8+ and git. Clone the repo and install dependencies:

Bash
git clone https://github.com/4ngel2769/rwdl.git
cd rwdl
pip install -r requirements.txt

Basic Usage

Download all .pdf and .epub files from a directory (depth 1):

Bash
python rwdl.py \
  --url https://archive.org/download/somecollection/ \
  --extension .pdf,.epub \
  --output ./ebooks \
  --depth 1

Full Options

Bash
python rwdl.py \
  --url https://example.com/files/ \    # Base URL to start downloading from
  --depth 3 \                           # Recursion depth (0=base only)
  --extension .torrent,.iso \           # File extensions to download
  --output ./downloads \                # Output directory
  --delay 0.5                           # Optional delay between requests

Arguments

ArgumentShortRequiredDefaultDescription
--url-uYesBase URL to start downloading from
--extension-eYesComma-separated file extensions to download
--depth-dNo1Recursion depth (0=base only)
--output-oNo./downloadsOutput base directory
--delayNo0.5Delay between requests in seconds
--help-hNoShow help message
--version-vNoShow version and exit

How RWDL Works (Under the Hood)

RWDL uses a breadth-first search (BFS) algorithm to traverse web directories. It parses each directory page, finds all links, and queues up subdirectories and files for processing. It only downloads files that match your specified extensions, and it decodes URL-encoded filenames (so %20 becomes a space, etc.) for proper local saving.

Here's a simplified version of the core logic:

rwdl.pyPython
from collections import deque
import os, requests, urllib.parse
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def is_valid_extension(filename, extensions):
    return any(filename.endswith(ext) for ext in extensions)

def parse_directory(url):
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    return [a['href'] for a in soup.select('a[href]')]

def main():
    base_url = "https://example.com/files/"
    extensions = [".pdf", ".epub"]
    output_dir = "./downloads"
    queue = deque([(base_url, 0, output_dir)])
    visited = set()
    while queue:
        url, depth, local_base = queue.popleft()
        if url in visited: continue
        visited.add(url)
        links = parse_directory(url)
        for link in links:
            abs_url = urljoin(url, link)
            decoded = urllib.parse.unquote(link)
            if abs_url.endswith('/'):
                # Directory
                dir_name = os.path.basename(abs_url.rstrip('/'))
                dir_name = urllib.parse.unquote(dir_name)
                new_local = os.path.join(local_base, dir_name)
                os.makedirs(new_local, exist_ok=True)
                queue.append((abs_url, depth+1, new_local))
            else:
                # File
                filename = os.path.basename(abs_url)
                filename = urllib.parse.unquote(filename)
                if is_valid_extension(filename, extensions):
                    local_path = os.path.join(local_base, filename)
                    if not os.path.exists(local_path):
                        # Download file
                        with requests.get(abs_url, stream=True) as r:
                            with open(local_path, 'wb') as f:
                                for chunk in r.iter_content(8192):
                                    f.write(chunk)

Key Features

  • URL decoding: Handles spaces and special characters in filenames
  • Smart filtering: Skips navigation and non-file links
  • Resumable: Skips files you already have
  • Cross-platform: Works anywhere Python does

Output Structure

RWDL mirrors the remote directory structure locally. For example:

downloads/
β”œβ”€β”€ folder1/
β”‚   β”œβ”€β”€ file1.pdf
β”‚   └── file2.pdf
β”œβ”€β”€ folder2/
β”‚   └── nested/
β”‚       └── file3.epub
└── base_file.pdf

Building RWDL: The Approach

I built RWDL with these principles:

  • Simplicity: One script, minimal dependencies (requests, beautifulsoup4)
  • Reliability: Handles network errors gracefully, skips broken links
  • Transparency: Prints progress and what it's doing
  • Extensibility: Easy to add new features or tweak for your own needs

The hardest part was handling all the weird edge cases in directory listings and making sure filenames were always saved correctly, even with spaces or Unicode characters.

Conclusion

RWDL is my go-to tool for bulk downloading from web directories. Whether you're archiving research papers, grabbing ISOs, or mirroring open data, it saves you time and hassle. Give it a tryβ€”and feel free to contribute or suggest features!


*Ready to download? Check out the GitHub repository and start mirroring your favorite web

Copyright 2024 Β© All rights reserved β€” Angel C.