Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Featured, Performance

Linux and High I/O Wait

Submitted by on December 21, 2008 – 12:07 am 3 Comments

When you look at the CPU activity of your computer, one of the parameters is the iowait. This value shows how much time your CPU wastes while it is waiting for I/O operations for complete. These include disk read/write operations, network, IPC, etc. Is this behavior a problem and, if so, what causes it and how to fix it? On one of the popular Unix-related forums some “genius” wrote:

The iowait “problem” is funny. It’s like when people complain that Linux is “using all my memory”. Yeah, no shit. You should be upset if you are copying files and your computer is /not/ in 100% iowait.

In reality, 100% iowait indicates that there is a problem and in most cases – a big problem that may even lead to data loss. Essentially, there is a bottleneck somewhere in the system. Maybe one of your disks is getting ready to die; or, perhaps, the NIC firmware is having problems with the latest kernel upgrade you installed. The troubleshooting process starts with the potentially more serious possibility: bad disk.

Take a quick look at /etc/messages, /etc/dmesg, /etc/boot.log and any other system log files. You are looking for disk I/O errors, failed read/write operations, bad sectors – anything that indicates a hardware problem with a disk. If you don’t find anything, look for IRQ and disk controller errors. Also look for memory errors and kernel panics. The three most likely culprits of high iowait are: bad disk, faulty memory and network problems.

If you still see nothing relevant, it is time to test your system. If possible, kick all the users off the box, shut down Web server, database and any other user applications. Log in via command line and stop XDM.

Open three shell windows: run “top” in one, “iostat -x 1” in the other and “find /etc -type f -print” in the third. Make sure you can see all three windows at the same time. This is a simple test that should generate some I/O activity on the system disk. Repeat this process for other disks. If you see iowait hovering near 100%, chances are you have a problem, but we don’t know what it is yet. However, now we do know that network is probably not the cause.

Next step, lets stress out your CPU but not the disks. The command below will try to create an endless zip file in /dev/null. This generates no disk activity, but loads the CPU. Continue running “top” and “iostat -x 1” in the other two windows.

If you see high CPU load but low iowait, we can eliminate CPU issues, IRQ conflicts, and faulty memory. Just to be on the safe side, let’s test memory anyway:

This server has 508644Kb of RAM. Use the corresponding value for the following test:

The three MD5 values above should be identical. If they are not – your system has a faulty RAM chip.

When you have eliminated hardware problems as possible causes of high iowait, the next step is to review firmware and drivers. You are particularly interested in disk controller firmware: unstable performance and no error messages are the signs of a firmware problem. Try really hard to remember if you made any system changes recently, especially something that required a reboot – like kernel upgrade, for example. If this is the case, roll back the upgrade or search for upgraded firmware. You should grab a copy of Sysinfo (free 30-day trial) to help you identify makes and models of your disks, controllers, etc.

While your disks and controllers may be tip-top, you may have a problem with a filesystem. Even if you see high iowait when accessing any filesystem, you should still check out the partition where /var is mounted and swap – if there is a problem, it will manifest itself regardless of what your system is doing. But here you will run into a little issue: fsck will not scan a mounted partition and you cannot unmount /var. Let’s say these are your partitions:

You need to fsck /dev/hda2 because this is where your /var is mounted. Download KNOPPIX or Ubuntu LiveCD, boot from CD (without installing) and “fsck /dev/hda2” from there. If everything looks clean, shut down your system, take the CD out and boot normally. The next step is to check out swap. If you just run fsck on the swap partition, it will fail:

You need to disable swap on /dev/hda1 before you can scan it. Before you can do this, you need to add another swap area: you cannot run without any swap space. So, to add swap on the fly, create a swap file (1Gb in this example):

Now you can set up and activate the new swap file:

Now we need to deactivate the original swap partition. This operation may take a couple minutes to complete:

The next step is to create a standard filesystem on the old swap partition, so that fsck has something to scan:

The previous operation already ran fsck and so, if you see no errors, you can now re-activate your original swap space and remove the temporary swap you created:

Another command commonly used for analyzing system bottlenecks is vmstat. The following example runs vmstat five times at 2-second intervals:

Explanation of vmstat columns:

(a) procs is the process-related fields are:

* r: The number of processes waiting for run time.
* b: The number of processes in uninterruptible sleep.

(b) memory is the memory-related fields are:

* swpd: the amount of virtual memory used.
* free: the amount of idle memory.
* buff: the amount of memory used as buffers.
* cache: the amount of memory used as cache.

(c) swap is swap-related fields are:

* si: Amount of memory swapped in from disk (/s).
* so: Amount of memory swapped to disk (/s).

(d) io is the I/O-related fields are:

* bi: Blocks received from a block device (blocks/s).
* bo: Blocks sent to a block device (blocks/s).

(e) system is the system-related fields are:

* in: The number of interrupts per second, including the clock.
* cs: The number of context switches per second.

(f) cpu is the CPU-related fields are:

These are percentages of total CPU time.

* us: Time spent running non-kernel code. (user time, including nice time)
* sy: Time spent running kernel code. (system time)
* id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
* wa: Time spent waiting for IO. Prior to Linux 2.5.41, shown as zero.

If you failed to identify the cause of the iowait problem, you should consider the possibility that there is no problem: perhaps your system is handling extra load and running short on resources. Take a look at the running processes and see what’s eating up memory. Perhaps you upgraded an application and now it is using more RAM, which leads to high swapping, which leads to high disk activity, which leads to high iowait.

The solutions are simple:

1. Install more RAM
2. Move swap to another disk or – even better – move it to another disk on a separate controller.
3. Move user applications to another disk/controller and specify default log locations outside of the system disk.

Print Friendly, PDF & Email

3 Comments »

  • Avatar cml21 says:

    Igor,

    Thanks so much for this! I’ve been troubleshooting a high “wa” percentage all weekend. I haven’t beaten it yet, but with your post, I now feel like I’m no longer taking shots in the dark!

    -CML

  • Avatar _marky_mark_ says:

    1, Most CD-writer are rated with three numbers like 48X/12X/48X on their tray(a typical CD-writer for example) what do these speeds represent?
    2, The most imprortant componenet when buying a computer is the motherboard . what features of a motherboard do you like when purchasing one?
    3 ,mark the clear margin of CPU speed between different Pentium processors.try to identify the major advancement between all Intel processor.
    4, Processor can be identified by two main parameter How wide and How fast they are.three main specifications in a processor(regarding width) are expressed.
    a) Data I/O bus
    b) Internal Registers
    c) Address bus
    Explain the concept behind the aforementioned specifications.

  • Avatar Daniel says:

    I have Rayman’s Raving Rabbids and wanted to play with more than one person at a time. When we select our characters, I get a message saying I don’t have a second controller activated. I can’t activate one in the game to my knowledge, and I haven’t been able to figure out how to add one from the Wii main screen when it’s just turned on.

    Any help would be appreciated.

    P.S. I know people will tell me to read the manual for the Wii, but I don’t have one. My friend gave me the Wii because he had to move due to family problems. He didn’t give me any documentation since he was in a hurry.

Leave a Reply

%d bloggers like this: