Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells

Wget and User-Agent Header

Submitted by on June 29, 2009 – 12:41 pm 7 Comments

As you may already know, Wget is a popular (particularly in the Unix world) command-line downloader and Web crawler application. You can read more about Wget in one of my earlier posts on the subject. One issue with Wget is that some sites block it from accessing their content. This is usually done by adding Wget to the robots.txt on the Web server and by configuring the server to reject requests with the user-agent header containing “wget”.

There are a couple of things you can do to circumvent these roadblocks. The robots.txt issue is dealt with simply by adding the “-e robots=off” option to the end of your Wget string (has to come after the URL), as shown in the example below.

wget -m -k "http://www.gnu.org/software/wget/" -e robots=off

A custom user-agent string can be set with Wget using the “-U ” option:

wget -m -k -U "Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/20121223 Ubuntu/9.25 (jaunty) Firefox/3.8" "http://www.gnu.org/software/wget/" -e robots=off

Let’s say you wanted to use Wget to download a list of URLs going through different proxy servers and using different user-agent headers for each URL. You will need a list of proxy servers that looks similar to this:

202.175.3.112:80
119.40.99.2:8080
193.37.152.236:3128
83.2.83.44:8080
151.11.83.170:80
119.70.40.101:8080
208.78.125.18:80

You will need a list of URLs to be downloaded, one URL per line. And you will need a list of real user-agent strings (download one here), looking something like this:

Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/20121223 Ubuntu/9.25 (jaunty) Firefox/3.8
Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8
Mozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2a1pre) Gecko/20090428 Firefox/3.6a1pre
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2a1pre) Gecko/20090405 Firefox/3.6a1pre

The script below will pick a random proxy server from proxy_list.txt for each URL in url_list.txt and will use a random user-agent string from user_agents.txt, hopefully making the target Web server believe that you are hundreds of different users from all over the world.
#!/bin/bash

proxies_total=$(wc -l proxy_list.txt | awk '{print $1}')
user_agents_total=$(wc -l user_agents.txt | awk '{print $1}')

cat url_list.txt | while read url
do
	# Select a random proxy server from proxy_list.txt
	read proxy_server_random=$(cat proxy_list.txt | while read proxy_server
	do
		echo "`expr $RANDOM % $proxies_total`^$proxy_server"
	done | sort -n | sed 's/[0-9]*^//' | head -1)
	
	# Set the shell HTTP proxy variable
	export http_proxy="$proxy_server_random"
	
	# Select random user-agent from user_agents.txt
	user_agent_random=$(cat user_agents.txt | while read user_agent
	do
		echo "`expr $RANDOM % $user_agents_total`:$user_agent"
	done | sort -n | sed 's/[0-9]*://' | head -1)
	
	# Download the URL
	echo "Downloading $url" 
	echo "Proxy server: $proxy_server_random"
	echo "User agent: $user_agent_random"
	
	$WGET -q --proxy=on -U "$user_agent_random" "$url"
done

Print Friendly, PDF & Email

7 Comments »

  • Cpt Excelsior says:

    Is there a way to see a list of all the directorys in the current directory I am on?
    So say in the directory i am in I decide to do
    mkdir test.
    Is there a command that can show that directory in a list? So I can see the “test” directory along with all other in a list?

    Also what is the differnse between
    Yum and wget?

  • have faith says:

    ok, everytime me and my husband have relationship i get really wet and my husband doesnt like it. he saids i get really wet that he doesnt enjoy so i would like to know what can i do to no wget wet so much… please help

  • Andre says:

    i set up a virtual server running on ubuntu server. i tried apt-get, curl, wget, ftp and scp and none of them will download any packages. how can i get packages installed? it doesn’t even have vi.

  • Alex says:

    My upload speed is very slow, downloading a file and uploading it onto my website would take too much time and bandwidth. Is there any way I can download a file from one website onto my own website? My web host doesn’t allow me SSH/telnet access so I can’t use wget. Any other ideas?
    I don’t think FXP will work either unless you can somehow do HTTP to http://FTP...
    I can’t log onto my webserver, at least for a command prompt, such as telnet or ssh. How else would you suggest I log onto my webserver that has the ability to download a file from another website?

  • Harriet W says:

    When was wget (the linux download tool) written?

  • kiltakblog says:

    the s/w should measure the webpages downloaded by the browsers and any other content movies, etc that i may download either using torrents or something like wget….
    essentially it should display how much data i have downloaded totally….

  • RuMKilleR says:

    I have a problem with wget as the links are relative and i want them to be absolute.
    I mean i am caching recursively and i want a link let’s say ../index.php like that in the page:
    http://www.original_domain/index.php

1 Pingbacks »

Leave a Reply to Harriet W Cancel reply

%d bloggers like this: