Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells

Wget Google image collector

Submitted by on February 1, 2006 – 1:07 pm 18 Comments

Google Images is an extremely useful tool for webmasters, designers, editors, and just about anybody else who’s in a hurry to find just the right photo or clipart. However, this Google tool has a couple of annoying issues. First, the collection of images is updated once in a while. Not very often. This means that whenever you get your search results, there will be a bunch of dead links. Second, many Web sites prevent direct linking of images, so you cannot open just the image – you have to open the entire Web page where the image is located.

The idea behind the following BASH shell is to address these two small problems, as well as to save you some time downloading images. The script reads from a list of keywords formatted exactly as you would format your Google search query. For example, to find a bunch of photos of the Red Square in Moscow you would type something like: +”red square” +moscow. This will search for “red square” as a phrase plus the word “moscow”.

The search query is passed to Google as a URL along with the desired image size (icon, small, medium, large, xlarge, xxlarge). If you are looking only for the largest photos, you would specify the image size as “xxlarge”. However, if you are looking for image size from “large” and up, you would use the “pipe” character in place of a logical “OR”: large|xlarge|xxlarge. See the script below for an example.

The following step for the script is to count the number of search hits and download all Google pages listing the search results. A search result may say something like: “found 250 matches, displaying images 1-20”. Once again, the script formats an appropriate URL and sends it to Google to get search results 21-40, 41-60, etc. until it reaches 250.

All the search results are parsed to extract only the image URLs linking to the actual photos on various sites. These URLs are dumped into a file and fed to Wget. The direct linking issue is no longer a concern, because you open each image URL directly and your “HTTP Referrer” is blank. The downloaded images are written to a separate directory for each search query. A minimum file size filter is applied to the downloaded photos: all images smaller than, say, 30Kb are deleted. ImageMagick is used to enlarge images if they are smaller than a certain size.

You will need to create a file called “keywords.txt” in the same directory as the script. The file should contain one search query per line. Here’s an example:

+"red square" +moscow
+"times square" +"new york"
+bush +iraq +wmd +"big fat liar"

And here’s the script:

#!/bin/bash

# Collect and resize images from Google Images based on a keyword.

#------------------------------
# VARIABLES
#------------------------------

DATETIME=$(date +'%Y-%m-%d')
HOMEDIR="/WD120GB_01/misc/google"
SIZE="large|xlarge|xxlarge"
#SIZE="icon|small|medium|large|xlarge|xxlarge"

#------------------------------
# CONFIGURATION
#------------------------------

if [ ! -d "$HOMEDIR" ]
then
        echo "Home directory $HOMEDIR not found. Exiting..."
        exit 1
fi

if [ ! -f "${HOMEDIR}/keywords.txt" ]
then
        echo "Keyword list not found. Exiting..."
        exit 1
fi

#------------------------------
# IMAGE SEARCH & DOWNLOAD
#------------------------------

STATUS=1

i=0
cat "${HOMEDIR}/keywords.txt" | while read KEYWORD
do
        OUTDIR=$(echo "$KEYWORD" | sed 's/ /_/g' | sed 's/+/_/g' | sed "s/'//g" | sed 's/"//g' | sed 's/*//g' | sed 's/^_//g' | sed 's/__/_/g')
        OUTDIR=$(echo "${HOMEDIR}/${OUTDIR}_${DATETIME}")

        if [ ! -d "${OUTDIR}" ]
        then
                mkdir "${OUTDIR}"
        else
                echo "Directory ${OUTDIR} already exists. Skipping..."
                STATUS=0
        fi

        if [ $STATUS -eq 1 ]
        then
                URL="http://images.google.com/images?q=${KEYWORD}&svnum=100&hl=en&lr=&safe=off&sa=G&imgsz=${SIZE}"
                wget -U Mozilla -O "${OUTDIR}/results_${i}.txt" "$URL" -e robots=off

                cat "${OUTDIR}/results_${i}.txt" | sed 's/href/n/g' | grep imgurl | grep imgrefurl | sed 's/imgurl=/@/g' | sed 's/&imgrefurl/@/g' | awk -F'@' '{print $2}' > "${OUTDIR}/image_urls.txt"
                results=$(cat "${OUTDIR}/results_${i}.txt" | sed 's/border/n/g' | fgrep '&start=' | fgrep -v '&start=0' | sort | uniq | fgrep '
' | wc -l | awk '{print $1}')

                i=1
                while [ $i -lt $results ]
                do
                        (( START = i * 20 ))
                        URL="http://images.google.com/images?q=${KEYWORD}&svnum=100&hl=en&lr=&safe=off&sa=G&imgsz=${SIZE}&start=${START}"
                        wget -U Mozilla -O "${OUTDIR}/results_${i}.txt" "$URL" -e robots=off
                        cat "${OUTDIR}/results_${i}.txt" | sed 's/href/n/g' | grep imgurl | grep imgrefurl | sed 's/imgurl=/@/g' | sed 's/&imgrefurl/@/g' | awk -F'@' '{print $2}' >> "${OUTDIR}/image_urls.txt"
                        (( i = i + 1 ))
                done

                find "$OUTDIR" -type f -name "results_*.txt" -exec rm {} ;
                cat "${OUTDIR}/image_urls.txt" | fgrep '.jpg' | sort | uniq > /tmp/google_image_collector.tmp
                mv /tmp/google_image_collector.tmp "${OUTDIR}/image_urls.txt"

                if [ -f "${OUTDIR}/image_urls.txt" ]
                then
                        clear
                        COUNT=$(cat "${OUTDIR}/image_urls.txt" | wc -l | awk '{print $1}')
                        echo "Found $COUNT images matching $KEYWORD"

                        j=1
                        cat "${OUTDIR}/image_urls.txt" | while read LINE
                        do
                                wget -U Mozilla -nd -t 1 -T 5 -O "${OUTDIR}/photo_${j}.jpg" "$LINE" -e robots=off

                                if [ `ls -als "${OUTDIR}/photo_${j}.jpg" | awk '{print $6}'` -lt 10000 ]
                                then
                                        rm "${OUTDIR}/photo_${j}.jpg"
                                else
                                        convert -filter Cubic -resize '640x640< ' "${OUTDIR}/photo_${j}.jpg" "${OUTDIR}/photo_${j}.jpg"
                                        (( j = j + 1 ))
                                fi
                        done
                fi
        fi
done
Print Friendly, PDF & Email

18 Comments »

  • metalelf0 says:

    Good script. I modified it a little to only download the first X images for each row. Thank you for posting this! :)

  • ramadan says:

    Great script :-)

    How would you modify it to only take the first concrete good image?

    Thanks

  • Milo Lamar says:

    I love your website. It has some very useful and insightful scripts. That’s an understatement. They’re life changing! :) Definitely going to blog your site and reference it in my own forums, if that’s okay? I put my website in the line above http://www.mdlwebsolutions.com/ is a new site that will aim to contain a lot of technical web development content.

  • jack` says:

    sounds very interesting but i cant get it to work, is there any value to change or adjust, exept the keywords.
    how should i name the script files, extention ???? and how are you running it ?

    i kind of understand the script, but i can’t understand the proper syntax yet.

    thanx for help

  • misha says:

    Thanks man, this works really well. Saved me a lot of time.

  • NiCloAy says:

    Thank you for great job!.
    It’s very helpful in my script for anki card’s.

  • doc jim says:

    Works great.
    Needs to be updated for new Google images url (easy).
    Best script to date.

  • filip says:

    Many thanks. I’ve been looking exactly for script like this for one of my projects. My thanks for you

  • esaruoho says:

    I hope I’m allowed to be a bit thick and ask – does this already, or can this ever, capture the original largest possible image? I’ve just made the leap to create a little folder from which images are imported into iMovie and I’m trying to create as much script-based automation as possible into getting the required bits of images from the web to be able to immediately acquire keyword-based search results of images and directly import them into iMovie.

    This & Quartz Composer & Scripting are going to keep me busy for quite a few days!

  • Maxmillie says:

    This script creats directory and files that are locked. Then it terminates because it can’t overwrite those directory. This happens even when I sudo chmod 777 the directory. And it also happens when I sudo ./file.sh. This is frustrating has anyone experience this same issue?

  • omed says:

    Hi dears…
    I need to download 10000 google images of type JPG for research purpose. What is the best way for doing this?
    Please I need a clear and well stated instructions, as I am not familiar with Wget.

  • ryan says:

    can you update this script for use with the new google images url?

    I added
    &sout=1

    to the end of the string and it still doesn’t seem to work for me.

    any thoughts?

    thanks

  • Coffee t says:

    I was just wondering how can Google get profits by providing us a free researching system. We pay nothing and the google allows us to search free. How does it work??
    Thank you

  • nick s says:

    I have data in a Google spreadsheet that I would like to automatically import into my WordPress website. How do I do it?

  • SteveO says:

    Please, I really need help to get back to normal google. I want to get rid of google chrome immediately! I don’t want to just change my homepage back to google but the whole internet as well. Thanks!

  • Tyler H says:

    I really like using google, but every time I shut down, even though I change the default, and even remove the option of Yahoo!, I still end up getting yahoo search defaulted the next time I go onto google chrome. Please help?

  • nasty1 says:

    In my old computer, a macbook, whenever i clicked on the shortcut to quit google chrome, it would make me have to hold it, but when i got a new computer, an imac, i installed the newer version of google chrome and whenever i accidentally click on the shortcut, it just quits right away. i don’t like that since i tend to click on the wrong keys most of the time. is there any way i can make it like the older version?

2 Pingbacks »

Leave a Reply

%d bloggers like this: