Wget and User-Agent Header
As you may already know, Wget is a popular (particularly in the Unix world) command-line downloader and Web crawler application. You can read more about Wget in one of my earlier posts on the subject. One issue with Wget is that some sites block it from accessing their content. This is usually done by adding Wget to the robots.txt on the Web server and by configuring the server to reject requests with the user-agent header containing “wget”.
There are a couple of things you can do to circumvent these roadblocks. The robots.txt issue is dealt with simply by adding the “-e robots=off” option to the end of your Wget string (has to come after the URL), as shown in the example below.
wget -m -k "http://www.gnu.org/software/wget/" -e robots=off
A custom user-agent string can be set with Wget using the “-U
wget -m -k -U "Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/20121223 Ubuntu/9.25 (jaunty) Firefox/3.8" "http://www.gnu.org/software/wget/" -e robots=off
Let’s say you wanted to use Wget to download a list of URLs going through different proxy servers and using different user-agent headers for each URL. You will need a list of proxy servers that looks similar to this:
202.175.3.112:80 119.40.99.2:8080 193.37.152.236:3128 83.2.83.44:8080 151.11.83.170:80 119.70.40.101:8080 208.78.125.18:80
You will need a list of URLs to be downloaded, one URL per line. And you will need a list of real user-agent strings (download one here), looking something like this:
Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/20121223 Ubuntu/9.25 (jaunty) Firefox/3.8 Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8 Mozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8 Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2a1pre) Gecko/20090428 Firefox/3.6a1pre Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2a1pre) Gecko/20090405 Firefox/3.6a1pre
The script below will pick a random proxy server from proxy_list.txt for each URL in url_list.txt and will use a random user-agent string from user_agents.txt, hopefully making the target Web server believe that you are hundreds of different users from all over the world.
#!/bin/bash proxies_total=$(wc -l proxy_list.txt | awk '{print $1}') user_agents_total=$(wc -l user_agents.txt | awk '{print $1}') cat url_list.txt | while read url do # Select a random proxy server from proxy_list.txt read proxy_server_random=$(cat proxy_list.txt | while read proxy_server do echo "`expr $RANDOM % $proxies_total`^$proxy_server" done | sort -n | sed 's/[0-9]*^//' | head -1) # Set the shell HTTP proxy variable export http_proxy="$proxy_server_random" # Select random user-agent from user_agents.txt user_agent_random=$(cat user_agents.txt | while read user_agent do echo "`expr $RANDOM % $user_agents_total`:$user_agent" done | sort -n | sed 's/[0-9]*://' | head -1) # Download the URL echo "Downloading $url" echo "Proxy server: $proxy_server_random" echo "User agent: $user_agent_random" $WGET -q --proxy=on -U "$user_agent_random" "$url" done
Popularity: 7% [?]
Related posts:


