How to , and other stuff about linux, photo, php … A linux, photography blog. To remember some linux situation, and fix them quickly.

November 20, 2014

How to get md RAID array to rebuild even if read errors

Filed under: Linux — Tags: , , , , — admin @ 3:14 pm

Well hope this will help someone like me because I really didn’t find exactly all information on internet.
So .. My situation was like this raid 1 software . On smartctl both hard drive look pretty mess up so I decide to switch sdb first. So, I remove the sdb from raid, make the tiket to datacenter and they say won’t change the hard drive because is ok. So I put the sdb back in place but my surprise was big. The raid didn’t build and get a failed . And the sdb was put in (S) spare .

cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sdb4[2] sda4[0]
1851802943 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[2](S) sda3[0]
1073740664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sdb2[2] sda2[0]
524276 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sdb1[2] sda1[0]
4193268 blocks super 1.2 [2/2] [UU]

unused devices:
When you have 1Tb is a lot of waiting . So panic was on me, why the sbd3 is in S state and is not get in place again. When I was looking on the dmesg log surprise. Because some cluster can’t be read from sda, the raid failed :(. Well this situation is not so ok.. because if both hard drive will have only one cluster bad you are in situation that you can’t replace the hard drive.

So what is the solution. Well there is no magic one like force or skip , or perhaps exist but I didn’t find the exact solution.
The dmesg look something like this
[7646674.321121] end_request: I/O error, dev sda, sector 1251467520
[7646674.321148] raid1: sda3: rescheduling sector 1242024160
[7646691.820040] end_request: I/O error, dev sda, sector 1251467520
[7646691.841995] raid1:md2: read error corrected (8 sectors at 1242026240 on sda3)
[7646709.327315] end_request: I/O error, dev sda, sector 1251467528
[7646709.327923] raid1:md2: read error corrected (8 sectors at 1242026248 on sda3)
[7646726.817936] end_request: I/O error, dev sda, sector 1251467536
[7646726.818461] raid1:md2: read error corrected (8 sectors at 1242026256 on sda3)
[7646744.291904] end_request: I/O error, dev sda, sector 1251467544
[7646744.292431] raid1:md2: read error corrected (8 sectors at 1242026264 on sda3)
[7646762.473787] end_request: I/O error, dev sda, sector 1251467560
[7646762.474294] raid1:md2: read error corrected (8 sectors at 1242026280 on sda3)
[8337479.557401] end_request: I/O error, dev sda, sector 1251471424
[8337479.557427] raid1: sda3: rescheduling sector 1242028000
[8337506.426294] end_request: I/O error, dev sda, sector 1251471432
[8337506.480066] raid1:md2: read error corrected (8 sectors at 1242030152 on sda3)
[8337507.764690] raid1: sda3: redirecting sector 1242028000 to another mirror
[8769133.060807] end_request: I/O error, dev sda, sector 1251476840
[8769133.060833] raid1: sda3: rescheduling sector 1242033496
[8769150.443149] end_request: I/O error, dev sda, sector 1251476840
[8769150.443726] raid1:md2: read error corrected (8 sectors at 1242035560 on sda3)
[8769150.443772] raid1: sda3: redirecting sector 1242033496 to another mirror

Well this is after a grep with sectors in search. The modern hard drive can relocate some sector if they are bad, but to do that a write on that sector must occur . So I start using hdparam to check and write that sectors.
First of all I was making a grep on dmesg do seee the sectors
dmesg | grep -i sectors , or something like this to find out what are the sector.
For me something like this was the output
[9328180.676069] end_request: I/O error, dev sda, sector 1251477144
[9328249.705549] end_request: I/O error, dev sda, sector 1251477144
[9338442.525476] end_request: I/O error, dev sda, sector 1251477144
[9338482.362380] end_request: I/O error, dev sda, sector 1251477144
[9328232.631537] end_request: I/O error, dev sda, sector 1251471488
[9338452.461797] end_request: I/O error, dev sda, sector 1251471488
[9338499.453213] end_request: I/O error, dev sda, sector 1251471488
[9328169.224894] end_request: I/O error, dev sda, sector 1251471456
[9328215.499044] end_request: I/O error, dev sda, sector 1251471456
[9338465.239646] end_request: I/O error, dev sda, sector 1251471456
[9338520.575210] end_request: I/O error, dev sda, sector 1251471456
[9338456.717853] end_request: I/O error, dev sda, sector 1251470641
[9338503.592660] end_request: I/O error, dev sda, sector 1251470641
[9328154.873250] end_request: I/O error, dev sda, sector 1251470640
[9328198.474859] end_request: I/O error, dev sda, sector 1251470640

Well not in this order but .. something like this ( here is after I sorted bout sector number )
So in order to check the sector you can do this

hdparm –read-sector 1251470640 /dev/sda
/dev/sdb: Input/Output error

PLEASE NOTE THAT WRITING THOSE SECTOR IS ON YOUR RISK AND IRREVOCABLE DATA LOSS MAY OCCURRED.
I STRONGLY SUGGEST TO HAVE BACKUP IN CASE YOU CAN’T DO ANYTHING AND A RE INSTALL OF OS IS NEEDED

In order to make a write you have to run this
hdparm –write-sector 1251470640 –yes-i-know-what-i-am-doing /dev/sda

After this I was run a read again and I see no error. So I was thinking ok I mark them all and run again a rebuild . Also this can be run if
smartctl -a /dev/sdb | grep -i reallocated
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always – 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always – 0

If you have 0 then you can write that sector because hard drive will try to relocate that sector.
So after I relocated all that sectors, I rebuild again, waited 2 hour and another error appear . So I took again a look on dmesg and see other sector that was next to that I first have.
So I started to think it there are no more . So I try to read more to see if error appear.
So I run something like this
for i in $(seq 1251470640 1251478000) ; do hdparm –read-sector $i/dev/sda |grep error ; done
If you see any line with error you have to search those and mark them. You can modify this script into more complex one and if error occur to make it write that sector .

A small script

#!/bin/bash
for i in $(seq 3645583000 3646853936)
do
a=$(hdparm –read-sector $i /dev/sdb)

if [ “$?” -ne “0” ]; then
echo $i
hdparm –write-sector $i –yes-i-know-what-i-am-doing /dev/sdb
fi

done

After I have marked them all I have manage to rebuild the raid again.

PLEASE NOTE THAT WRITING THOSE SECTOR IS ON YOUR RISK AND IRREVOCABLE DATA LOSS MAY OCCURRED.
I STRONGLY SUGGEST TO HAVE BACKUP IN CASE YOU CAN’T DO ANYTHING AND A RE INSTALL OF OS IS NEEDED

September 4, 2012

Replacing a defective drive from a raid 1

Filed under: Linux — Tags: , , , , , , , , — admin @ 10:51 am

Well yesterday I receive daily e-mail and saw that my raid is failing .

cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[1] sda1[0]
2102464 blocks [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[0]
264960 blocks [2/2] [UU]

md2 : active raid1 sdb3[2](F) sda3[0]
1462766336 blocks [2/1] [U_]

So that mean that sdb3 is marked as failed drive and U_ mean that raid is degraded.
Well from this point I remove the sdb1 and sdb2 from raid but before I mark them as failed

mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md0 --fail /dev/sdb1

mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md1 -r /dev/sdb2
mdadm /dev/md2 -r /dev/sdb3

After replacement of hard drive I have to recreate the same partition on new sdb and add it to raid.

sfdisk -d /dev/sda | sfdisk /dev/sdb
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2
mdadm /dev/md2 -a /dev/sdb3

Now watch how your raid is recovering
watch cat /proc/mdstat
Every 2.0s: cat /proc/mdstat Tue Sep 4 09:52:52 2012

Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[1] sda1[0]
2102464 blocks [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[0]
264960 blocks [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[0]
1462766336 blocks [2/1] [U_]
[===>.................] recovery = 16.0% (234480192/1462766336) finish=412.8min speed=49580K/sec

unused devices:

However the speed my by low so how to increase that speed ?

cat /proc/sys/dev/raid/speed_limit_max
200000
cat /proc/sys/dev/raid/speed_limit_min
1000

Now I increase the min limit to 50000
echo 50000 >/proc/sys/dev/raid/speed_limit_min

Now if you watch cat /proc/mdstat again you will see that your speed is improved and your time get low.

Powered by WordPress