I have a QNAP TS-453A NAS with 4 6TB disks configured as a RAID5 storage pool. Currently with firmware 4.4.2.1262.
Yesterday I discovered that one of the disks (disk 3) had failed, and that the storage pool was in a "Degraded" state.
I quickly got a new disk. The old disk was a WD Red 6TB WD60EFRX, the new one is a WD Red 6TB WD60AFEX. I hot-swapped the disks. According to the documentation, the new disk should be detected automatically, and the storage pool should automatically start rebuilding ("Rebuilding" state). But nothing happened.
I checked the UI, Storage & Snapshots tool. The storage pool was still in degraded state, but all four disks were now green and healthy. However, disk 3 was listed as "not a member" of the storage pool. When I selected to Manage the pool, I could do nothing. The only action that was not disabled was "Rebuild RAID Group", but when I tried that there were no free disks to add to the RAID group.
So the problem appeared to be that disk 3 had been detected and was in use, but still it was listed as "not a member" of the storage pool. No actions were available in the UI to fix the situation. Pulling out the disk and inserting it again did not change anything. Googling for help showed that others have encountered similar situations, but no solutions helped me.
I decided to have look "under the hood" to see if I could figure out what was wrong.
ssh admin@mynas [~] # cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] md1 : active raid5 sda3[0] sdd3[3] sdb3[1] 17551701504 blocks super 1.0 level 5, 512k chunk, algorithm 2 [4/3] [UU_U] md256 : active raid1 sdc2[2](S) sdd2[3](S) sdb2[1] sda2[0] 530112 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk md13 : active raid1 sdc4[24] sda4[0] sdd4[3] sdb4[1] 458880 blocks super 1.0 [24/4] [UUUU____________________] bitmap: 1/1 pages [4KB], 65536KB chunk md9 : active raid1 sdc1[24] sda1[0] sdd1[3] sdb1[1] 530048 blocks super 1.0 [24/4] [UUUU____________________] bitmap: 1/1 pages [4KB], 65536KB chunk
OK, so /dev/md1 is my RAID5 storage pool. Only /dev/sda3, /dev/sdb3 and /dev/sdd3 are part of the group. /dev/sd3c is missing. Let's check the group:
[~] # mdadm --misc --detail /dev/md1 /dev/md1: Version : 1.0 Creation Time : Tue Aug 23 05:48:30 2016 Raid Level : raid5 Array Size : 17551701504 (16738.61 GiB 17972.94 GB) Used Dev Size : 5850567168 (5579.54 GiB 5990.98 GB) Raid Devices : 4 Total Devices : 3 Persistence : Superblock is persistent Update Time : Sat Apr 4 18:10:54 2020 State : clean, degraded Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : 1 UUID : f82504b7:2c60d9bd:5676ec84:0a5ba214 Events : 27378 Number Major Minor RaidDevice State 0 8 3 0 active sync /dev/sda3 1 8 19 1 active sync /dev/sdb3 4 0 0 4 removed 3 8 51 3 active sync /dev/sdd3
OK, so 3 active devices, and /dev/sd3c is missing, as expected. Let's check if the disk exists and is formatted like the other disks:
[~] # parted GNU Parted 3.1 Using /dev/sdc Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) select /dev/sda select /dev/sda Using /dev/sda (parted) print print Model: WDC WD60EFRX-68L0BN1 (scsi) Disk /dev/sda: 6001GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 20.5kB 543MB 543MB ext3 primary 2 543MB 1086MB 543MB linux-swap(v1) primary 3 1086MB 5992GB 5991GB primary 4 5992GB 5993GB 543MB ext3 primary 5 5993GB 6001GB 8554MB linux-swap(v1) primary (parted) select /dev/sdc select /dev/sdc Using /dev/sdc (parted) print print Model: WDC WD60EFAX-68SHWN0 (scsi) Disk /dev/sdc: 6001GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 20.5kB 543MB 543MB ext3 primary 2 543MB 1086MB 543MB primary 3 1086MB 5992GB 5991GB primary 4 5992GB 5993GB 543MB ext3 primary 5 5993GB 6001GB 8554MB primary (parted) quit quit
OK, so the new disk seems to be formatted correctly. Let's just try to add the missing disk partition to the RAID group:
[~] # mdadm --manage /dev/md1 --add /dev/sdc3 mdadm: added /dev/sdc3 [~] # mdadm --misc --detail /dev/md1 /dev/md1: Version : 1.0 Creation Time : Tue Aug 23 05:48:30 2016 Raid Level : raid5 Array Size : 17551701504 (16738.61 GiB 17972.94 GB) Used Dev Size : 5850567168 (5579.54 GiB 5990.98 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Sat Apr 4 18:18:17 2020 State : active, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Rebuild Status : 0% complete Name : 1 UUID : f82504b7:2c60d9bd:5676ec84:0a5ba214 Events : 27846 Number Major Minor RaidDevice State 0 8 3 0 active sync /dev/sda3 1 8 19 1 active sync /dev/sdb3 4 8 35 2 spare rebuilding /dev/sdc3 3 8 51 3 active sync /dev/sdd3
Great! The RAID group is recovering! The NAS emitted two beeps and the status light started blinking red/green. In the UI, the storage pool state changed to "Rebuilding".
For some reason the NAS did not correctly add the /dev/sd3c disk partition to the storage pool. The disk had been correctly partitioned and the partitions formatted, and the other RAID arrays had apparently recovered, but not /dev/md1. Adding /dev/sd3c manually to /dev/md1 fixed the problem.
One more thing: it looks like /etc/config/mdadm.conf and /etc/config/raidtab are missing. /etc/mdadm.conf and /etc/raidtab existed as symbolic links to the non-existent files. I'm not sure that they are needed, but as a precaution I created them. mdadm.conf is created like this:
[~] # mdadm --detail -scan >> /etc/config/mdadm.conf
and this is the content of raidtab:
[~] # cat /etc/config/raidtab raiddev /dev/md1 raid-level 5 nr-raid-disks 4 nr-spare-disks 0 chunk-size 4 persistent-superblock 1 device /dev/sda3 raid-disk 0 device /dev/sdb3 raid-disk 1 device /dev/sdc3 raid-disk 2 device /dev/sdd3 raid-disk 3