| File Compression TestingKrazyWorks

Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Featured, Files

File Compression Testing

Submitted by Igor on January 27, 2019 – 5:29 pm 2 Comments

For some reason I haven’t used zip much on Linux, sticking to the standard tar/gzip combo. But zip seems to be a viable alternative. While not as space-efficient, it is definitely faster; syntax is simple; and, if you need to share the archive with a Windows user, he doesn’t have to Google what the heck is “tar.gz”.

Having said that, this post is not so much about zip and how it compares to more commonly-used Linux CLI archiving tools. This post is more about cleverish use of Bash arrays holding PIDs and parallel, so that’s really what you want to pay attention to in the script below. (And you can also get it here.)

#!/bin/bash
echo "$(grep -c ^proc /proc/cpuinfo) x$(grep -m1 ^model.name /proc/cpuinfo | awk -F: '{print $2}')"

echo "Making base folder for our test and create a temporary file"
d=/var/tmp/test
mkdir -p $d
cd $d
f=$(mktemp)

echo "Downloading a large text file into the temporary file"
curl -k -s0 -q https://norvig.com/big.txt > $f

echo "Creating a folder structure populated with files, each containing 128KB of random text"
for i in $(seq -w 01 10)
do
  mkdir -p dir_${i}
  echo "Populating dir_${i}"
  for j in $(seq -w 001 100)
  do
    { head -c 128KB <(shuf -n 10000 $f) > ./dir_${i}/file_${j} & } 2>/dev/null 1>&2
    pids+=($!)
  done
done

for pid in ${pids[*]}
do
  wait ${pid} 2>/dev/null 1>&2
done

echo -n "Determine the number of parallel threads based on the available cores: "
p=$(grep -c proc /proc/cpuinfo)
echo $p

echo ""
echo "Running a test with zip"
echo "Before: $(du -s . | awk '{print $1}')"

find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'zip -r -q {}{.zip,} && /bin/rm -r {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'

echo "After: $(du -s . | awk '{print $1}')"
ls *zip | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'unzip -q {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'

echo ""
echo "Running a test with tar/gzip"
echo "Before: $(du -s . | awk '{print $1}')"

find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'GZIP=-9 tar cfz {}{.tgz,} && /bin/rm -r {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'

echo "After: $(du -s . | awk '{print $1}')"
ls *tgz | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'tar xfz {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'

echo ""
echo "Running a test with tar/bzip2"
echo "Before: $(du -s . | awk '{print $1}')"
find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'BZIP=-9 tar cfj {}{.tbz,} && /bin/rm -r {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'

echo "After: $(du -s . | awk '{print $1}')"
ls *tbz | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'tar xfj {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'

echo ""
echo "Running a test with tar/pigz"
echo "Before: $(du -s . | awk '{print $1}')"
find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'tar cf - {} | pigz -9 -p $p > {}.tar.gz }' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'

echo "After: $(du -s . | awk '{print $1}')"
ls *tar.gz | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'tar xfz {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'

echo ""
echo "Removing test folder"
/bin/rm -r $d

2 Comments »

Randy Zagar says:
February 2, 2019 at 11:28 pm
I’m probably going to stick with “pigz” as my gzip / zip replacement, since it’s explicitly designed to use multiple cores.

One thing I’ve done with parallel is effectively replace xargs with xargs(){parallel –will-cite $*}. I have my bashrc test to see if parallel is installed and have it replace xargs.

It definitely speeds up things like “find . -type f -print0 | xargs -0 md5sum”.

Loading...
Reply to this comment »
Abbas says:
February 13, 2019 at 1:48 am
informative

Loading...
Reply to this comment »