Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells

Wget examples and scripts

Submitted by on November 27, 2005 – 3:19 pm 27 Comments

Wget is a command-line Web browser for Unix and Windows. Wget can download Web pages and files; it can submit form data and follow links; it can mirror entire Web sites and make local copies. Wget is one of the most useful applications you would ever install on your computer and it is free.

You can download the latest version of Wget from the developers home page. Precompiled versions of Wget are available for Windows and for most flavors of Unix. Many Unix operating system have wget pre-installed, so type which wget to see if you already have it.

Wget supports a multitude of options and parameters. This variety may be confusing to people unfamiliar with Wget. You can view the available Wget options by typing wget –help or on a Unix box type man wget.

Here are a few useful examples on how to use Wget:

1) Download main page of Yahoo.com and save it as yahoo.htm

wget -O yahoo.htm http://www.yahoo.com

2) Use Wget with an HTTP firewall:

Set proxy in Korn or Bash shells

export http_proxy=proxy_server.domain:port_number

Set proxy in C-shell

setenv http_proxy proxy_server.domain:port_number

Run wget for anonymous proxy

wget -Y -O yahoo.htm "http://www.yahoo.com"

Run wget for proxy that requires authentication

wget -Y --proxy-user=your_username --proxy-passwd=your_password -O yahoo.htm "http://www.yahoo.com"

3) Make a local mirror of Wget home page that you can browse from your hard drive

Here are the options we will use:

-m to mirror the site
-k to make all links local
-D to stay within the specified domain
–follow-ftp to follow FTP links
-np not to ascend to the parent directory

The following two options are to deal with Web sites protected against automated download tools such as Wget:

-U to mascarade as a Mozilla browser
-e robots=off to ignore no-robots server directives

wget -U Mozilla -m -k -D gnu.org --follow-ftp -np "http://www.gnu.org/software/wget/" -e robots=off

4) Download all images from Playboy site

Here are the options we will use:

-r for recursive download
-l 0 for unlimited levels
-t 1 for one download attempt per link
-nd not to create local directories
-A to download only files with specified extentions

wget -r -l 0 -U Mozilla -t 1 -nd -D playboy.com -A jpg,jpeg,gif,png "http://www.playboy.com" -e robots=off

5) Web image collector

The following Korn-shell script reads from a list of URLs and downloads all images found anywhere on those sites. The images are processed and all images smaller than a certain size are deleted. The remaining images are saved in a folder with named after the URL. The url_list.txt file contains one URL per line.

This script was originally written to run under AT&T UWIN on Windows, but it will also work in any native Unix environment that has Korn shell.

#!/bin/ksh
# WIC.ksh - Web Image Collector
#
# WIC reads from a list of URLs and spiders each site recursively
# downloading images that match specified criteria (type, size).

#-----------------------------------------------------------------
# ENVIRONMENT CONFIGURATION
#-----------------------------------------------------------------

WORKDIR="C:/Downloads"	# Working directory
cd "$WORKDIR"
OUTPUT="$WORKDIR/output"	# Final output directory
URLS="$WORKDIR/url_list.txt"	# List of URLs
WGET="/usr/bin/wget"		# Wget executable
SIZE="+7k"			# Minimum image size to keep

TMPDIR1="$WORKDIR/tmp1"		# Temporary directory 1
TMPDIR2="$WORKDIR/tmp2"		# Temporary directory 2
TMPDIR3="$WORKDIR/tmp3"		# Temporary directory 3

if [ ! -d "$WORKDIR" ]
then
	mkdir "$WORKDIR"
	if [ ! -d "$WORKDiR" ]
	then
		echo "Download directory not found. Exiting..."
		exit 1
	fi
fi

if [ ! -d "$OUTPUT" ]
then
	mkdir "$OUTPUT"
	if [ ! -d "$OUTPUT" ]
	then
		echo "Cannot create $OUTPUT directory. Exiting..."
		exit 1
	fi
fi

if [ ! -f "$URLS" ]
then
	echo "URL list not found in $WORKDIR. Exiting..."
	exit 1
fi

for i in 1 2 3
do
	if [ -d "$WORKDIR/tmp$i" ]
	then
		rm -r "$WORKDIR/tmp$i"
		mkdir "$WORKDIR/tmp$i"
	else
		mkdir "$WORKDIR/tmp$i"
	fi
done

if [ ! -f "$WGET" ]
then
	echo "$WGET executable not found. Exiting..."
	exit 1
fi

#-----------------------------------------------------------------
# DOWNLOAD IMAGES
#-----------------------------------------------------------------

cat "$URLS" | while read URL
do
	echo "Processing $URL"

	DOMAIN=$(echo "$URL" | awk -F'/' '{print $3}')

	if [ ! -d "$OUTPUT/$DOMAIN" ]
	then
		cd "$TMPDIR1"
		mkdir "$OUTPUT/$DOMAIN"
		$WGET --http-user=your_username --http-passwd=your_password -r -l 0 -U Mozilla -t 1 -nd -A jpg,jpeg,gif,png,pdf "$URL" -e robots=off
		find "$TMPDIR1" -type f -size "$SIZE" -exec mv {} "$OUTPUT/$DOMAIN" ;
		cd "$WORKDIR"
	else
		echo "	$URL already processed. Skipping..."
	fi

	for i in 1 2 3
	do
		if [ -d "$WORKDIR/tmp$i" ]
		then
			rm -r "$WORKDIR/tmp$i"
			mkdir "$WORKDIR/tmp$i"
		else
			mkdir "$WORKDIR/tmp$i"
			fi
	done
done

#-----------------------------------------------------------------
# Remove empty download directories
#-----------------------------------------------------------------

cd "$OUTPUT"
find . -type d | fgrep "./" | while read DIR
do
	if [ `ls -R "$DIR" | wc -l | awk '{print $1}'` -eq 0 ]
	then
		rmdir "$DIR"
	fi
done

cd "$WORKDIR"

6) Wget options

GNU Wget 1.8.1+cvs, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.

Startup:
  -V,  --version           display the version of Wget and exit.
  -h,  --help              print this help.
  -b,  --background        go to background after startup.
  -e,  --execute=COMMAND   execute a `.wgetrc'-style command.

Logging and input file:
  -o,  --output-file=FILE     log messages to FILE.
  -a,  --append-output=FILE   append messages to FILE.
  -d,  --debug                print debug output.
  -q,  --quiet                quiet (no output).
  -v,  --verbose              be verbose (this is the default).
  -nv, --non-verbose          turn off verboseness, without being quiet.
  -i,  --input-file=FILE      download URLs found in FILE.
  -F,  --force-html           treat input file as HTML.
  -B,  --base=URL             prepends URL to relative links in -F -i file.
       --sslcertfile=FILE     optional client certificate.
       --sslcertkey=KEYFILE   optional keyfile for this certificate.
       --egd-file=FILE        file name of the EGD socket.

Download:
       --bind-address=ADDRESS   bind to ADDRESS (hostname or IP) on local host.
  -t,  --tries=NUMBER           set number of retries to NUMBER (0 unlimits).
  -O   --output-document=FILE   write documents to FILE.
  -nc, --no-clobber             don't clobber existing files or use .# suffixes.
  -c,  --continue               resume getting a partially-downloaded file.
       --progress=TYPE          select progress gauge type.
  -N,  --timestamping           don't re-retrieve files unless newer than local.
  -S,  --server-response        print server response.
       --spider                 don't download anything.
  -T,  --timeout=SECONDS        set the read timeout to SECONDS.
  -w,  --wait=SECONDS           wait SECONDS between retrievals.
       --waitretry=SECONDS      wait 1...SECONDS between retries of a retrieval.
       --random-wait            wait from 0...2*WAIT secs between retrievals.
  -Y,  --proxy=on/off           turn proxy on or off.
  -Q,  --quota=NUMBER           set retrieval quota to NUMBER.
       --limit-rate=RATE        limit download rate to RATE.

Directories:
  -nd  --no-directories            don't create directories.
  -x,  --force-directories         force creation of directories.
  -nH, --no-host-directories       don't create host directories.
  -P,  --directory-prefix=PREFIX   save files to PREFIX/...
       --cut-dirs=NUMBER           ignore NUMBER remote directory components.

HTTP options:
       --http-user=USER      set http user to USER.
       --http-passwd=PASS    set http password to PASS.
  -C,  --cache=on/off        (dis)allow server-cached data (normally allowed).
  -E,  --html-extension      save all text/html documents with .html extension.
       --ignore-length       ignore `Content-Length' header field.
       --header=STRING       insert STRING among the headers.
       --proxy-user=USER     set USER as proxy username.
       --proxy-passwd=PASS   set PASS as proxy password.
       --referer=URL         include `Referer: URL' header in HTTP request.
  -s,  --save-headers        save the HTTP headers to file.
  -U,  --user-agent=AGENT    identify as AGENT instead of Wget/VERSION.
       --no-http-keep-alive  disable HTTP keep-alive (persistent connections).
       --cookies=off         don't use cookies.
       --load-cookies=FILE   load cookies from FILE before session.
       --save-cookies=FILE   save cookies to FILE after session.

FTP options:
  -nr, --dont-remove-listing   don't remove `.listing' files.
  -g,  --glob=on/off           turn file name globbing on or off.
       --passive-ftp           use the "passive" transfer mode.
       --retr-symlinks         when recursing, get linked-to files (not dirs).

Recursive retrieval:
  -r,  --recursive          recursive web-suck -- use with care!
  -l,  --level=NUMBER       maximum recursion depth (inf or 0 for infinite).
       --delete-after       delete files locally after downloading them.
  -k,  --convert-links      convert non-relative links to relative.
  -K,  --backup-converted   before converting file X, back up as X.orig.
  -m,  --mirror             shortcut option equivalent to -r -N -l inf -nr.
  -p,  --page-requisites    get all images, etc. needed to display HTML page.

Recursive accept/reject:
  -A,  --accept=LIST                comma-separated list of accepted extensions.
  -R,  --reject=LIST                comma-separated list of rejected extensions.
  -D,  --domains=LIST               comma-separated list of accepted domains.
       --exclude-domains=LIST       comma-separated list of rejected domains.
       --follow-ftp                 follow FTP links from HTML documents.
       --follow-tags=LIST           comma-separated list of followed HTML tags.
  -G,  --ignore-tags=LIST           comma-separated list of ignored HTML tags.
  -H,  --span-hosts                 go to foreign hosts when recursive.
  -L,  --relative                   follow relative links only.
  -I,  --include-directories=LIST   list of allowed directories.
  -X,  --exclude-directories=LIST   list of excluded directories.
  -np, --no-parent                  don't ascend to the parent directory.
Print Friendly, PDF & Email

27 Comments »

  • Joey 01 says:

    how to use/write perl script in linux copy file from windows?

    I need to execute perl script from linux server copy a file from windows drive folder.
    example like:
    the file is located at c:abcdxyz.log
    i need to copy this xyz.log to a linux/unixbase server directory /var/opt/abc/xyz.log by using perl script on linux/unix base server.

    how to write the perl script to perform the activity above?
    any example?

    Pls help! Thank You.

  • Sonny says:

    I seen many operating systems in Linux that are free. What is the best and takes up the least amount of space. Please also tell me the recommended space needed for the os.

  • John G says:

    computers have a different operating system. Other choices are also available, such as UNIX and Linux. Why do you think Windows is so prevalent? What do the other systems have to offer that Windows does not? Who uses the other systems?

4 Pingbacks »

Leave a Reply to Solution: How can I download an entire website? #fix #answer #development | StackCopy Cancel reply

%d bloggers like this: