Sunday, June 09, 2019

wget on Windows

Overview

This is to document my steps to download all image (JPG) files along with PDF and regular HTML files instead of using the web browser, using only 1 command (wget).

Installation

Use Choco (https://chocolatey.org/). Follow installation instructions @ https://chocolatey.org/install

Then open a command prompt with administrative rights to install wget:
choco install wget

Usage

My target website (say abc.com) is protected by BASIC authentication. I am only interested in downloading files with extensions *.jpg, *.pdf & *.html. So I will create a directory to have the files placed i.e. c:\abc. Then, just run the commands below:
cd c:\abc 
wget --user-agent="Googlebot/2.1 (+https://www.googlebot.com/bot.html)" --http-user=user123 --http-password=coder4life -A "*.jpg,*.html,*.pdf" -r https://www.abc.com/folder123/ -l=0
where

--user-agent = User agent string to let the web server of target website to know about the kind of client/browser that is connecting. If not specified the value is "wget" which some web servers may block access

--http-user = BASIC username

--http-password = BASIC password (plain text)

-A = Inclusion list to download

-r = Tells wget to recursively get files (search the website for all possible paths/files)

-l = How "deep" should wget go. Default is 5, meaning from the URL https://www.abc.com/folder123/, wget can go until /folder123/1/2/3/4/5 and stop looking. The command above has value 0, which means "infinite" (until all possible paths are traversed)