Quick Review: Boxee Box
December 27, 2011 – 12:22 am | 3 Comments

Some of the technical issues with Boxee Box could have been fixed if the dev team was paying more attention to addressing the bugs rather than adding “features” of dubious value. In the final analysis, for the price and ease of use, Boxee Box is the best in its class and price range. You just need to be mindful of its limitations and buy it in hope of future improvements to its usability.

Read the full story »
Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells

Wget and User-Agent Header

Submitted by on June 29, 2009 – 12:41 pmNo Comment
Wget and User-Agent Header

As you may already know, Wget is a popular (particularly in the Unix world) command-line downloader and Web crawler application. You can read more about Wget in one of my earlier posts on the subject. One issue with Wget is that some sites block it from accessing their content. This is usually done by adding Wget to the robots.txt on the Web server and by configuring the server to reject requests with the user-agent header containing “wget”.

There are a couple of things you can do to circumvent these roadblocks. The robots.txt issue is dealt with simply by adding the “-e robots=off” option to the end of your Wget string (has to come after the URL), as shown in the example below.

wget -m -k "http://www.gnu.org/software/wget/" -e robots=off

A custom user-agent string can be set with Wget using the “-U ” option:

wget -m -k -U "Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/20121223 Ubuntu/9.25 (jaunty) Firefox/3.8" "http://www.gnu.org/software/wget/" -e robots=off

Let’s say you wanted to use Wget to download a list of URLs going through different proxy servers and using different user-agent headers for each URL. You will need a list of proxy servers that looks similar to this:

202.175.3.112:80
119.40.99.2:8080
193.37.152.236:3128
83.2.83.44:8080
151.11.83.170:80
119.70.40.101:8080
208.78.125.18:80

You will need a list of URLs to be downloaded, one URL per line. And you will need a list of real user-agent strings (download one here), looking something like this:

Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/20121223 Ubuntu/9.25 (jaunty) Firefox/3.8
Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8
Mozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2a1pre) Gecko/20090428 Firefox/3.6a1pre
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2a1pre) Gecko/20090405 Firefox/3.6a1pre

The script below will pick a random proxy server from proxy_list.txt for each URL in url_list.txt and will use a random user-agent string from user_agents.txt, hopefully making the target Web server believe that you are hundreds of different users from all over the world.

#!/bin/bash
 
proxies_total=$(wc -l proxy_list.txt | awk '{print $1}')
user_agents_total=$(wc -l user_agents.txt | awk '{print $1}')
 
cat url_list.txt | while read url
do
	# Select a random proxy server from proxy_list.txt
	read proxy_server_random=$(cat proxy_list.txt | while read proxy_server
	do
		echo "`expr $RANDOM % $proxies_total`^$proxy_server"
	done | sort -n | sed 's/[0-9]*^//' | head -1)
 
	# Set the shell HTTP proxy variable
	export http_proxy="$proxy_server_random"
 
	# Select random user-agent from user_agents.txt
	user_agent_random=$(cat user_agents.txt | while read user_agent
	do
		echo "`expr $RANDOM % $user_agents_total`:$user_agent"
	done | sort -n | sed 's/[0-9]*://' | head -1)
 
	# Download the URL
	echo "Downloading $url" 
	echo "Proxy server: $proxy_server_random"
	echo "User agent: $user_agent_random"
 
	$WGET -q --proxy=on -U "$user_agent_random" "$url"
done

Popularity: 7% [?]

Related posts:

  1. Wget examples and scripts
  2. Wget Google image collector
  3. Linux User Activity Management
  4. Move USER to new primary group

Leave a comment!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.