Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells

Wget Google image collector

Submitted by on February 1, 2006 – 1:07 pm18 Comments

Google Images is an extremely useful tool for webmasters, designers, editors, and just about anybody else who’s in a hurry to find just the right photo or clipart. However, this Google tool has a couple of annoying issues. First, the collection of images is updated once in a while. Not very often. This means that whenever you get your search results, there will be a bunch of dead links. Second, many Web sites prevent direct linking of images, so you cannot open just the image – you have to open the entire Web page where the image is located.

The idea behind the following BASH shell is to address these two small problems, as well as to save you some time downloading images. The script reads from a list of keywords formatted exactly as you would format your Google search query. For example, to find a bunch of photos of the Red Square in Moscow you would type something like: +”red square” +moscow. This will search for “red square” as a phrase plus the word “moscow”.

The search query is passed to Google as a URL along with the desired image size (icon, small, medium, large, xlarge, xxlarge). If you are looking only for the largest photos, you would specify the image size as “xxlarge”. However, if you are looking for image size from “large” and up, you would use the “pipe” character in place of a logical “OR”: large|xlarge|xxlarge. See the script below for an example.

The following step for the script is to count the number of search hits and download all Google pages listing the search results. A search result may say something like: “found 250 matches, displaying images 1-20″. Once again, the script formats an appropriate URL and sends it to Google to get search results 21-40, 41-60, etc. until it reaches 250.

All the search results are parsed to extract only the image URLs linking to the actual photos on various sites. These URLs are dumped into a file and fed to Wget. The direct linking issue is no longer a concern, because you open each image URL directly and your “HTTP Referrer” is blank. The downloaded images are written to a separate directory for each search query. A minimum file size filter is applied to the downloaded photos: all images smaller than, say, 30Kb are deleted. ImageMagick is used to enlarge images if they are smaller than a certain size.

You will need to create a file called “keywords.txt” in the same directory as the script. The file should contain one search query per line. Here’s an example:

And here’s the script:

18 Comments »

1 Pingbacks »

Leave a comment!

Add your comment below, or trackback from your own site. You can also Comments Feed via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.