Main Contents

Replacing a failed drive in a software raid without rebooting

October 22, 2008

I lost a drive in one of my servers this week. Well, didn’t lose it completely, but SMART errors were piling up. The drive was a SATA disk on /dev/sda, the boot disk. It hosted RAID 1 /boot, swap space, and RAID 5 /. These are the steps I took to replace it with no reboot necessary.

If the drive gets pulled or is completely unavailable for any reason before performing certain steps, a reboot is required. The reason is it may be impossible to set a drive as failed if it isn’t online, therefore when you plug in the replacement drive it may come up as /dev/sde or /dev/sdx instead of /dev/sda. But, if you remove the drive from the raid sets properly via mdadm while it is still attached, then it shouldn’t be an issue. So don’t be too hasty in getting the old drive out.

Make sure grub is installed on /dev/sdb, in case I needed to reboot. (Actually, I always do this when I setup software raid, just in case I lose /dev/sda completely.)

# grub
grub> root (hd1,0)
grub> setup (hd1)
grub> quit

Turn off the swap space

# swapoff /dev/sda2

Print out the current raid config (I’ll refer to this many times while double-checking my commands. A typo can be disasterous.)

# cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sda1[0] sdb1[1] sdc1[2] sdd1[3]
      104320 blocks [4/4] [UUUU]

md1 : active raid5 sda3[0] sdb3[1] sdc3[2] sdd3[3]
      30723840 blocks level 5, 256k chunk, algorithm 2 [4/4] [UUUU]

Fail the disk in the raid arrays.

# mdadm --manage /dev/md0 --fail /dev/sda1
# mdadm --manage /dev/md1 --fail /dev/sda3

Remove the disk from the raid arrays.

# mdadm --manage /dev/md0 --remove /dev/sda1
# mdadm --manage /dev/md1 --remove /dev/sda3

Pull the drive and insert the new drive. Check dmesg and make sure the new drive came up as the same device node /dev/sda. If not, now is the time to reboot and hope things come up.

Partition the new drive. In my case, the partition table between /dev/sda and /dev/sdb is identical, so I copy it over.

# sfdisk -d /dev/sdb | sfdisk /dev/sda

Create the swap space and turn it on

# mkswap /dev/sda2
# swapon /dev/sda2

Add the partitions back to the raid arrays

# mdadm --manage /dev/md0 --add /dev/sda1
# mdadm --manage /dev/md1 --add /dev/sda3


Filed under: Hardware, Linux | Comments Off on Replacing a failed drive in a software raid without rebooting

Sorry, the comment form is closed at this time.