File Compression Testing
For some reason I haven’t used zip much on Linux, sticking to the standard tar/gzip combo. But zip seems to be a viable alternative. While not as space-efficient, it is definitely faster; syntax is simple; and, if you need to share the archive with a Windows user, he doesn’t have to Google what the heck is “tar.gz”.
Having said that, this post is not so much about zip and how it compares to more commonly-used Linux CLI archiving tools. This post is more about cleverish use of Bash arrays holding PIDs and parallel, so that’s really what you want to pay attention to in the script below. (And you can also get it here.)
#!/bin/bash
echo "$(grep -c ^proc /proc/cpuinfo) x$(grep -m1 ^model.name /proc/cpuinfo | awk -F: '{print $2}')"
echo "Making base folder for our test and create a temporary file"
d=/var/tmp/test
mkdir -p $d
cd $d
f=$(mktemp)
echo "Downloading a large text file into the temporary file"
curl -k -s0 -q https://norvig.com/big.txt > $f
echo "Creating a folder structure populated with files, each containing 128KB of random text"
for i in $(seq -w 01 10)
do
mkdir -p dir_${i}
echo "Populating dir_${i}"
for j in $(seq -w 001 100)
do
{ head -c 128KB <(shuf -n 10000 $f) > ./dir_${i}/file_${j} & } 2>/dev/null 1>&2
pids+=($!)
done
done
for pid in ${pids[*]}
do
wait ${pid} 2>/dev/null 1>&2
done
echo -n "Determine the number of parallel threads based on the available cores: "
p=$(grep -c proc /proc/cpuinfo)
echo $p
echo ""
echo "Running a test with zip"
echo "Before: $(du -s . | awk '{print $1}')"
find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'zip -r -q {}{.zip,} && /bin/rm -r {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'
echo "After: $(du -s . | awk '{print $1}')"
ls *zip | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'unzip -q {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'
echo ""
echo "Running a test with tar/gzip"
echo "Before: $(du -s . | awk '{print $1}')"
find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'GZIP=-9 tar cfz {}{.tgz,} && /bin/rm -r {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'
echo "After: $(du -s . | awk '{print $1}')"
ls *tgz | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'tar xfz {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'
echo ""
echo "Running a test with tar/bzip2"
echo "Before: $(du -s . | awk '{print $1}')"
find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'BZIP=-9 tar cfj {}{.tbz,} && /bin/rm -r {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'
echo "After: $(du -s . | awk '{print $1}')"
ls *tbz | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'tar xfj {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'
echo ""
echo "Running a test with tar/pigz"
echo "Before: $(du -s . | awk '{print $1}')"
find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'tar cf - {} | pigz -9 -p $p > {}.tar.gz }' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'
echo "After: $(du -s . | awk '{print $1}')"
ls *tar.gz | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'tar xfz {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'
echo ""
echo "Removing test folder"
/bin/rm -r $d


I’m probably going to stick with “pigz” as my gzip / zip replacement, since it’s explicitly designed to use multiple cores.
One thing I’ve done with parallel is effectively replace xargs with xargs(){parallel –will-cite $*}. I have my bashrc test to see if parallel is installed and have it replace xargs.
It definitely speeds up things like “find . -type f -print0 | xargs -0 md5sum”.
informative