How to , and other stuff about linux, photo, php … A linux, photography blog. To remember some linux situation, and fix them quickly.

December 3, 2013

Linux Wait IO Problem

Filed under: Linux — Tags: , , , , — admin @ 3:27 pm

One of the problem that is related to my latest post is the Wait IO problem. So if in top or other state command you see WA then is a problem with your IO.
But how to figure out what is the problem ?

Well first of all you can take a look on how your server is going with the waiting.
You can run vmstat to see what is happening.

vmstat 2
procs ———–memory———- —swap– —–io—- –system– —–cpu—–
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 252848 1252264 2449400 20381888 0 0 25 77 3 1 6 2 58 29 0
0 1 252848 1241816 2449408 20381956 0 0 34 148 3102 2892 10 3 84 3 0
0 0 252848 1257000 2449408 20382036 0 0 0 468 1336 1201 3 1 33 58 0
3 0 252848 1243228 2449412 20382332 0 0 140 134 2278 1955 7 2 90 1 0
0 1 252848 1245444 2449420 20382460 0 0 68 304 2558 2255 7 2 40 42 0
1 1 252848 1231920 2449424 20382648 0 0 0 122 1708 1356 4 1 61 28 0

From here we are interested into the pre-last column the wa . From what we see there are all the time some values there, this mean the server is waiting the IO . Now we see that we have some problem.
But here you don’t see anything about what is causing this. You only check and see if you are having IO problem .

Please note, in this result, the sum of columns id (Idle) and wa (Wait IO) is almost 100, this is leading to some problem on configuration . In most of the cases is from ext4 journal.

How to find the cause ?
Well we can check with ps auxf command .

ps auxf |more
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2 0.0 0.0 0 0 ? S Nov10 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S Nov10 1:26 \_ [migration/0]
root 4 0.0 0.0 0 0 ? S Nov10 1:04 \_ [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S Nov10 0:00 \_ [migration/0]
root 6 0.0 0.0 0 0 ? S Nov10 0:01 \_ [watchdog/0]
root 7 0.0 0.0 0 0 ? S Nov10 0:07 \_ [migration/1]
root 8 0.0 0.0 0 0 ? S Nov10 0:00 \_ [migration/1]
root 9 0.0 0.0 0 0 ? S Nov10 3:33 \_ [ksoftirqd/1]
root 10 0.0 0.0 0 0 ? S Nov10 0:01 \_ [watchdog/1]
root 11 0.0 0.0 0 0 ? S Nov10 0:03 \_ [migration/2]
root 12 0.0 0.0 0 0 ? S Nov10 0:00 \_ [migration/2]
root 13 0.0 0.0 0 0 ? S Nov10 6:11 \_ [ksoftirqd/2]
root 14 0.0 0.0 0 0 ? S Nov10 0:01 \_ [watchdog/2]
root 15 0.0 0.0 0 0 ? S Nov10 0:02 \_ [migration/3]
root 16 0.0 0.0 0 0 ? S Nov10 0:00 \_ [migration/3]

And here we are interested about STAT column .

The stat can have this values:
D Uninterruptible sleep (usually IO)
R Running or runnable (on run queue)
S Interruptible sleep (waiting for an event to complete)
T Stopped, either by a job control signal or because it is being traced.
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z Defunct (“zombie”) process, terminated but not reaped by its parent.

Additional characters may be displayed:
< high-priority (not nice to other users) N low-priority (nice to other users) L has pages locked into memory (for real-time and custom IO) s is a session leader l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do) + is in the foreground process group So from here we deduct that if the stat have "D" then is a problem with the process. To cut down process that is "eating" your CPU time, you can use this command which samples all process with D flag in every second: while true; do date; ps auxf | awk ‘{if($8==”D”) print $0;}’; sleep 1; done

Tue Dec 3 13:22:06 CET 2013
Tue Dec 3 13:22:07 CET 2013
Tue Dec 3 13:22:08 CET 2013
Tue Dec 3 13:22:09 CET 2013
Tue Dec 3 13:22:10 CET 2013
root 500 0.0 0.0 0 0 ? D Nov10 2:14 \_ [jbd2/sda6-8]
nobody 14675 0.1 0.1 134932 39620 ? D 13:20 0:00 \_ /usr/local/apache/bin/httpd -k start -DSSL
Tue Dec 3 13:22:11 CET 2013
Tue Dec 3 13:22:12 CET 2013
root 981 0.0 0.0 0 0 ? D Nov10 3:13 \_ [kjournald]
Tue Dec 3 13:22:13 CET 2013
Tue Dec 3 13:22:14 CET 2013

From the result you can see that [jbd2/sda6-8] [kjournald] is creating some problem.

Also you may use iotop or atop to see some stats.

You can check only those process with this command .

while true; do ps auxf | grep D | grep -E “(jbd2\/dm\.*|kdmflush)”; sleep 1; done

Ok, if the problem is related to the ext4 journal I think you should consider to remove that. If your server is on development you don’t need it, also if is in production then you may better use a raid for having your information in good condition.

Regards

Powered by WordPress