| Parallel RsyncKrazyWorks

Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells, Featured

Parallel Rsync

Submitted by Igor on June 10, 2017 – 3:55 pm One Comment

This is an update of the script I originally wrote five years ago and used to migrate many terabytes of production data between two NAS systems. What’s new: more efficient subfolder crawling, more effective way to launch rsync threads, ability to specify options from command line.

Here’s the problem with rsync: it’s a single-threaded process that needs to crawl the source and the destination directories in their entirety, build lists of folders and files, compare them, and then start transferring the discovered items: one by one.

This is not an issue when files are few and large. However, when the files are small and many and spread throughout a deep directory structure – this is when rsync grinds to a virtual halt. You may have a 10-gig network and rsync seems to be busy moving files, but your network utilization is a tiny fraction of the available bandwidth. The reason for that lies in rsync‘s single-threaded nature.

The workaround I am suggesting is to, basically, launch a separate rsync for every subfolder down to a certain level. And then another rsync to pick up whatever files were left above that level. There is a flow control in the script that will see how many cores your system has and will keep the number of rsyncs running at any given time down to a reasonable number, so not to kill your machine.

The script is below and you can also download it here. Save it and create a convenient link in /usr/bin/rsync-parallel. Here’s how to use it:

Syntax:
rsync-parallel -o <rsync options; default: -aKPHAXx> -d <branch-out depth> -s <source_dir> -t <target_dir>

Example:
rsync-parallel -o "avKx --timeout=5" -d 2 -s /mnt/source -t /mnt/target

One thing to remember: the --delete option or any of its variations will not work with this script. The purpose of this script was to do initial synchronization. However, you can use the following rsync syntax to only delete items from the destination if they were removed from the source. This will only delete – it will not copy anything new:

rsync -avKx --delete --ignore-non-existing --ignore-existing <source> <target>

As a test, I created a dummy folder structure (1111 folders) with some files (110200 files) using this script:

# created 110200 files in 11111 folders totalling 1.8GB
for i in `seq 1 10`; do
echo "Top level $i"
for j in `seq 1 10`; do
for k in `seq 1 10`; do
for l in `seq 1 10`; do
mkdir -p /archive/source/dir_${i}/dir_${j}/dir_${k}/dir_${l}
for n in `seq 1 10`; do
dd if=/dev/zero of=/archive/source/dir_${i}/dir_${j}/dir_${k}/dir_${l}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done
for n in `seq 1 10`; do
dd if=/dev/zero of=/archive/source/dir_${i}/dir_${j}/dir_${k}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done
for n in `seq 1 10`; do
dd if=/dev/zero of=/archive/source/dir_${i}/dir_${l}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done
for n in `seq 1 10`; do
dd if=/dev/zero of=/archive/source/dir_${i}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done

First, using just rsync :

time rsync -aKx /archive/source/ /archive/target/

real    0m14.998s

Making sure everything is there:

# for i in source target; do for j in f d; do echo -e "${i}\t${j}:\t`find $i -type $j |wc -l`"; done; done; /bin/rm -r ./target/*
source  f:      110200
source  d:      11111
target  f:      110200
target  d:      11111

Now I remove everything from the target and repeat the process this time using the script:

time rsync-parallel -d 2 -s /archive/source -t /archive/target && time while [ `ps -ef | grep -c [r]sync` -ne 0 ]; do sleep 1; done
Level max: 4
4 /archive/source/dir_9
4 /archive/source/dir_8
4 /archive/source/dir_7
4 /archive/source/dir_6
4 /archive/source/dir_5
4 /archive/source/dir_4
4 /archive/source/dir_3
4 /archive/source/dir_2
4 /archive/source/dir_10
4 /archive/source/dir_1

real    0m0.324s

real    0m1.265s

Again, to make sure everything is there:

for i in source target; do for j in f d; do echo -e "${i}\t${j}:\t`find $i -type $j |wc -l`"; done; done; /bin/rm -r ./target/*
source  f:      110200
source  d:      11111
target  f:      110200
target  d:      11111

So the script was about ten times faster than just rsync by itself. Keep in mind that in this example both source and target were local filesystems. If even one of them was NFS-mounted, the time advantage of the script would have been even greater.

#!/bin/bash
#                                      |
#                                  ___/"\___
#                          __________/ o \__________
#                            (I) (G) \___/ (O) (R)
#                                Igor Oseledko
#                           igor@comradegeneral.com
#                                 2017-06-10
# ----------------------------------------------------------------------------
# A script to use rsync to copy complex directory structures, starting several
# levels below the parent source directory and running multiple rsync threads
# at the same time to utilize the available bandwidth.
# ----------------------------------------------------------------------------
IFS=$(echo -en "\n\b")
usage() {
cat << EOF
Syntax:
---------------------
rsync-parallel -o <rsync options; default: -aKPHAXx> -d <branch-out depth> -s <source_dir> -t <target_dir>
Example:
---------------------
rsync-parallel -d 3 -s /mnt/source -t /mnt/target
EOF
exit 1
}
while getopts ":s:t:d:o:" OPTION; do
case "${OPTION}" in
s)
source_dir="${OPTARG}"
;;
t)
target_dir="${OPTARG}"
;;
d)
max_depth="${OPTARG}"
;;
o)
rsync_options="${OPTARG}"
;;
\? ) echo "Unknown option: -$OPTARG" >&2; usage;;
:  ) echo "Missing option argument for -$OPTARG" >&2; usage;;
*  ) echo "Unimplemented option: -$OPTARG" >&2; usage;;
esac
done
if [ -z "${source_dir}" ]
then
echo "Source directory must be specified"
usage
fi
if [ -z "${target_dir}" ]
then
echo "Target directory must be specified"
usage
fi
if [ -z "${max_depth}" ]
then
echo "Branch-out depth must be specified"
usage
fi
if [ -z "${rsync_options}" ]
then
rsync_options="aKHAXx"
fi
configure() {
if [ "${source_dir}" == "${target_dir}" ] ; then echo "Source and target directories must not be the same! Exiting..." ; exit 1 ; fi
if [ ${max_depth} -lt 2 ] ; then echo "Minimum search depth must be 2. Exiting..." ; exit 1 ; fi
cpu_count=$(cat /proc/cpuinfo|grep processor | wc -l)
let max_threads=cpu_count*30
sleep_time=3
export RSYNC="/usr/bin/rsync -${rsync_options}"
randomnum=$(echo "`expr ${RANDOM}${RANDOM} % 1000000`+1"|bc -l)
logdir="/var/log/rsync"
if [ ! -d "${logdir}" ] ; then mkdir -p "${logdir}" ; fi
cd "${logdir}"
filelist="${logdir}/filelist_${randomnum}"
if [ -f "${filelist}" ] ; then /bin/rm -f "${filelist}" ; fi
split_prefix="${logdir}/filelist_split_${randomnum}_"
/bin/rm -f ${split_prefix}*
dirlist="${logdir}/dirlist_${randomnum}"
if [ -f "${dirlist}" ] ; then /bin/rm -f "${dirlist}" ; fi
tmplist="/${logdir}/tmplist_${randomnum}"
if [ -f "${tmplist}" ] ; then /bin/rm -f "${tmplist}" ; fi
level_min=$(echo "${source_dir}" | awk -F'/' '{print NF}')
let level_max=level_min+max_depth-1
logfile=${logdir}/`echo ${source_dir} | awk -F'/' '{print $NF}'`_`date +'%Y-%m-%d'`_${randomnum}_log.txt
if [ -f "${logfile}" ] ; then /bin/rm -f "${logfile}" ; fi
logfile_files=${logdir}/`echo ${source_dir} | awk -F'/' '{print $NF}'`_`date +'%Y-%m-%d'`_files_${randomnum}_log.txt
if [ -f "${logfile_files}" ] ; then /bin/rm -f "${logfile_files}" ; fi
}
build_dir_list() {
echo "`date +'%Y-%m-%d %H:%M:%S'`    Looking for directories ${max_depth} levels deep from ${source_dir}" >> "${logfile}"
find "${source_dir}" -maxdepth ${max_depth} -mindepth 1 -mount -type d > "${dirlist}"
touch "${tmplist}"
echo "`date +'%Y-%m-%d %H:%M:%S'`    Pruning directory list. This may take a while..." >> "${logfile}"
echo "Level max: $level_max"
sort -r "${dirlist}" | while read dir
do
level=$(echo "${dir}" | awk -F'/' '{print NF}')
if [ ${level} -eq ${level_max} ]
then
echo "$level ${dir}"
echo "${dir}" >> "${tmplist}"
elif [ ${level} -gt ${level_min} ] && [ ${level} -lt ${level_max} ] && [ `grep -c "^${dir}/" "${tmplist}"` -eq 0 ]
then
echo "$level ${dir}"
echo "${dir}" >> "${tmplist}"
fi
done
sed "s@${source_dir}/@@g" < "${tmplist}" | sort > "${dirlist}"
}
build_file_list() {
echo "`date +'%Y-%m-%d %H:%M:%S'`    Looking for orphaned files" >> "${logfile}"
exclude_list=$(grep -v "\/" "${dirlist}" | sed 's@ @\\s@g' | awk -F'/' '{print "-not -path */"$1"/*"}' | sort | uniq)
max_depth_file=$(awk -F'/' '{print NF}' < $dirlist | sort -n | tail -1)
find "${source_dir}" -maxdepth ${max_depth_file} -mount -type f `eval echo ${exclude_list}` -prune 2>/dev/null | sed "s@${source_dir}@\.@g" > "${filelist}"
}
report() {
dircount=$(cat "${dirlist}" | grep -c .)
filecount=$(cat "${filelist}" | grep -c .)
echo "`date +'%Y-%m-%d %H:%M:%S'`    Found ${dircount} directories ${max_depth} levels deep and ${filecount} orphaned files" >> "${logfile}"
}
copy_files() {
if [ -f "${filelist}" ]
then
if [ `grep -c . "${filelist}"` -gt 0 ]
then
if [ `grep -c . "${filelist}"` -gt 2000 ]
then
let lines=`grep -c . "${filelist}"`/20
split -l ${lines} -a 10 -d "${filelist}" "${split_prefix}"
k=1 ; find "${logdir}" -mount -type f -name "${split_prefix}[0-9]*" | while read filelist_split
do
echo "`date +'%Y-%m-%d %H:%M:%S'`    Copying `wc -l ${filelist_split} | awk '{print $1}'` orphaned files found in ${filelist_split}" >> "${logfile}"
eval ${RSYNC} \
--log-file="${logfile_files}_${k}" \
--files-from="${filelist_split}" "${source_dir}/" "${target_dir}/" &disown
(( k = k + 1 ))
done
else
echo "`date +'%Y-%m-%d %H:%M:%S'`    Copying `wc -l ${filelist} | awk '{print $1}'` orphaned files" >> "${logfile}"
eval ${RSYNC} \
--log-file="${logfile_files}" \
--files-from="${filelist}" "${source_dir}/" "${target_dir}/" &disown
fi
fi
fi
}
copy_directories() {
threads=1
i=1
cat "${dirlist}" | grep . | while read subfolder
do
if [ ! -d "${target_dir}/${subfolder}" ]
then
echo "Creating target subfolder: ${target_dir}/${subfolder}" >> "${logfile}"
mkdir -p "${target_dir}/${subfolder}"
chown --reference="${source_dir}/${subfolder}" "${target_dir}/${subfolder}"
chmod --reference="${source_dir}/${subfolder}" "${target_dir}/${subfolder}"
else
echo "Target subfolder already exists: ${target_dir}/${subfolder}" >> "${logfile}"
fi
if [ ${threads} -le ${max_threads} ]
then
echo "`date +'%Y-%m-%d %H:%M:%S'`    Processing ${i} of ${dircount}: ${subfolder}" >> "${logfile}"
eval ${RSYNC} --exclude .etc/ \
"${source_dir}/${subfolder}/" "${target_dir}/${subfolder}/" &disown
let threads=threads+1
else
while [ `/bin/ps -ef | grep -v "[t]ar " | grep -v grep | grep -c "[r]sync "` -gt ${max_threads} ]
do
sleep ${sleep_time}
done
threads=1
echo "`date +'%Y-%m-%d %H:%M:%S'`    Processing ${i} of ${dircount}: ${subfolder}" >> "${logfile}"
eval ${RSYNC} \
"${source_dir}/${subfolder}/" "${target_dir}/${subfolder}/" &disown
let threads=threads+1
fi
let i=i+1
done
}
# RUNTIME
configure
build_dir_list
build_file_list
report
copy_files
copy_directories