Overview
This is to document my steps to download all image (JPG) files along with PDF and regular HTML files instead of using the web browser, using only 1 command (wget).Installation
Use Choco (https://chocolatey.org/). Follow installation instructions @ https://chocolatey.org/installThen open a command prompt with administrative rights to install wget:
choco install wget
Usage
My target website (say abc.com) is protected by BASIC authentication. I am only interested in downloading files with extensions *.jpg, *.pdf & *.html. So I will create a directory to have the files placed i.e. c:\abc. Then, just run the commands below:cd c:\abc
wget --user-agent="Googlebot/2.1 (+https://www.googlebot.com/bot.html)" --http-user=user123 --http-password=coder4life -A "*.jpg,*.html,*.pdf" -r https://www.abc.com/folder123/ -l=0where
--user-agent = User agent string to let the web server of target website to know about the kind of client/browser that is connecting. If not specified the value is "wget" which some web servers may block access
--http-user = BASIC username
--http-password = BASIC password (plain text)
-A = Inclusion list to download
-r = Tells wget to recursively get files (search the website for all possible paths/files)
-l = How "deep" should wget go. Default is 5, meaning from the URL https://www.abc.com/folder123/, wget can go until /folder123/1/2/3/4/5 and stop looking. The command above has value 0, which means "infinite" (until all possible paths are traversed)
No comments:
Post a Comment