Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells, Featured, Performance

Dealing with Runaway Processes

Submitted by on July 14, 2015 – 10:46 pm

There is no official definition of a “runaway process”. Generally, it is a process that ignores its scheduled priority. It can also be a process that enters an infinite loop. Or it can be a process that spawns a large number of new processes, causing system overflow.

A runaway process is not always a process that acts in an unexpected fashion. A more likely cause for such behavior is unexpected input or volume of input. The exact cause can be identified later based on the application log data. The purpose of process described here is to help you catch potential runaway processes.

The major identifying characteristic of a runaway process is very high CPU utilization over an extended period of time. This script looks at CPU time of particularly busy processes and compared it to the time elapsed since the processes were started. The preference is given to processes with the highest CPU time-to-elapsed time ratio.

The ratio of greater-than-one indicates that the process is using multiple CPU cores. Your options may include modifying application settings, adding CPU cores (on a VM), renicing the process, or setting CPU affinity (see below).

As a starting point, you may want to try “htop” (yum -y install htop). Sort by CPU% and see how long the busiest processes have been running. You can also accomplish this with “ps”:

ps -eLfww | grep -E "\s[0-9]{1,2}\-[0-9]{2}:[0-9]{2}:[0-9]{2}\s" | sort -k9,1n

In the end, identifying a “runaway process” is left to your best judgement. There is no rule of thumb. Unfortunately, you need good understanding of your application and the script below can only serve as a general guide.

Don’t rush to kill a suspected runaway process, unless you already know it is a problem. Try lowering its runtime priority with “renice” or confining it to particular CPU cores using the taskset command.

Example:

taskset -cp 30-31 ${pid}

This will restrict process ${pid} to cores 30 and 31. Keep in mind that child processes of ${pid} will be unaffected by this change. If you want to set CPU affinity for them as well, you will need to run the taskset command for each PID in /proc/${pid}/task/*

You can obtain the number of CPU cores by looking at the /proc/cpuinfo file:

grep processor /proc/cpuinfo

Alternatively, you can set CPU affinity for all processes running under a particular UID. You can also start a process with CPU affinity defined beforehand.

The advantage of this approach is that any child process spawned under the parent PID will respect the parent’s CPU affinity setting. This can be easily added to the /etc/init.d startup script for the particular application. The same is true for setting the processes’ nice level.

And here is the script. Save is as /var/adm/bin/runaways.sh, make it executable by root, and create a soft link:

ln -s /var/adm/bin/runaways.sh /usr/bin/runaways

The basic syntax is:
runaways | sort -rn | more

The options are as follows:
runaways [OPTION]

  -a				Look at all processes by all users
  -u <username1,username2>	Exclude listed system UIDs
  -l				Exclude all UIDs below 100 (default)

The output fields are:
1  - CPU time/Elapsed time (the higher the worse)
2  - CPU time in seconds
3  - Elapsed time in seconds
4  - CPU %
5  - Memory %
6  - CPU time
7  - Elapsed time
8  - Start time
9  - PID
10 - State
11 - UID
12 - Username
13 - Command
14 - Command with path and options

And here’s the script:
#!/bin/bash
#
#				      |
#		                  ___/"\___
#			  __________/ o \__________
#			    (I) (G) \___/ (O) (R)
#				   Igor Os
#                              krazyworks.com
# ----------------------------------------------------------------------------
# Identify potential runaway processes
#
# There is no official definition of a runaway process. Generally, it is a process that ignores its scheduled priority.
# It can also be a process that enters an infinite loop. Or it can be a process that spawns a large number of new processes,
# causing system overflow.
#
# A runaway process is not always a process that acts in an unexpected fashion. A more likely cause for such behavior is
# unexpected input or volume of input. The exact cause will be identified later based on the application log data. The
# purpose of this script is to help you catch potential runaway processes.
#
# The major identifying characteristic of a runaway process is very high CPU utilization over an extended period of time.
# This script looks at CPU time of a particularly busy process and compared it to the time elapsed since the process was
# started. The preference is given to processes with the highest CPU time/Elapsed time ratio.
#
# Don't rush to kill a suspected runaway process, unless you already know it is a problem. Try lowering its runtime priority
# with "renice" or confining it to a single CPU core using "taskset" command.
#
# Examples:
#
# taskset -cp 30-31 ${pid}
# This will restrict process ${pid} to cores 30 and 31. Keep in mind that child processes of ${pid} will be unaffected by
# this change. If you want to set CPU affinity for them as well, you will need to run the taskset command for each PID in
# /proc/${pid}/task/*
#
# ----------------------------------------------------------------------------

usage() {
cat << EOF
runaways [OPTION]

  -a					    Look at all processes by all users
  -u <username1,username2>      Exclude listed system UIDs
  -l					    Exclude all UIDs below 100 (default)

EOF
exit 0
}

if [ "" ]
then
	o=$(echo  | grep -Eo "[a-z]{1}")
else
	o=l
fi

if [ "${o}" == "u" ]
then
	if [ ! "" ]
	then
		usage
	else
		userlist=""
		for i in `echo ${userlist} | sed 's/,/ /g'`
		do
			if [ `grep -c ^${i}: /etc/passwd` -lt 1 ]
			then
				echo "Specified username ${i} must be a valid user. Exiting..."
				exit 1
			fi
		done
	fi
fi


configure() {
	n=10
}

tosec() {
	awk -F $':' '{if (NF == 2) {printf $1*60 + $2" "} else if (NF == 3) {split($1, a, "-"); \
	if (a[2] > 0) {printf ((a[1]*24+a[2])*60 + $2) * 60 + $3" ";} else {printf ($1*60 + $2) * 60 + $3" ";}}}'
}

export -f tosec

all_processes() {
	ps -e --no-headers -o %cpu,%mem,cputime,etime,stime,pid,state,uid,user,comm,command --sort -%cpu 2>/dev/null| head -${n} | \
	awk '{system("echo "$3 "| tosec"); system("echo "$4 "| tosec"); printf("%s \n",$0)}' | sort -rn -k1 | \
	while read line ; do a=$(echo ${line} | awk '{print $1}') ; b=$(echo ${line} | awk '{print $2}'); \
	r=$(echo "scale=2;${a}/${b}*1"|bc -l|awk '{printf "%.2f", $0}'); echo -e "${r} ${line}" ; done
}

exclude_specific_users() {
	ps -u ${userlist} --deselect --no-headers \
	-o %cpu,%mem,cputime,etime,stime,pid,state,uid,user,comm,command --sort -%cpu 2>/dev/null| head -${n} | \
	awk '{system("echo "$3 "| tosec"); system("echo "$4 "| tosec"); printf("%s \n",$0)}' | sort -rn -k1
}

exlude_low_uids() {
	ps -u `grep -E ":x:[0-9]{1,2}:[0-9]" /etc/passwd |  awk -F':' '{printf $1","}' | sed 's/,$//g'` \
	--deselect --no-headers -o %cpu,%mem,cputime,etime,stime,pid,state,uid,user,comm,command --sort -%cpu 2>/dev/null| head -${n} | \
	awk '{system("echo "$3 "| tosec"); system("echo "$4 "| tosec"); printf("%s \n",$0)}' | sort -rn -k1 | \
	while read line ; do a=$(echo ${line} | awk '{print $1}') ; b=$(echo ${line} | awk '{print $2}'); \
	r=$(echo "scale=2;${a}/${b}*1"|bc -l|awk '{printf "%.2f", $0}'); echo -e "${r} ${line}" ; done
}

# RUNTIME
configure
case ${o} in
	a) all_processes ;;
	u) exclude_specific_users ;;
	l) exlude_low_uids ;;
	*) usage ;;
esac

 

Print Friendly, PDF & Email

Leave a Reply