I had written a couple of weeks ago about some EqualLogic madness I was experiencing with a PS4000 array that we had installed for a customer earlier this year. Because they needed additional performance and capacity, the customer got an additional array, and once we had that in the group we were able to evacuate the data from the first one (we’ll call it EQL-01) and bring it back to our office for testing.
When we brought the array back to the office, it needed to be re-initialized since it was no longer part of a group. Much my consternation, I got the array initialized and found that all of the drives were either online or marked as spares. This was not what I was hoping to see. Fortunately, after running a few tests, we were able to reproduce the error. Here were the steps we took. Please note that it is never recommended that you reseat drives or pull controllers in the fashion described below. This was a blank array with known good disks. Don’t try this at home or in a production environment!
The array begins with drives 0-13 online in a RAID-50 configuration, running on controller 1. Drives 14 and 15 have been marked as hot spares. Drive 15 is the drive that has previously disappeared.
1. Remove drive 0 (active drive), causing the array to bring drive 14 (hot spare) online and leaving one hot spare (drive 15)
2. Remove drive 4 (active drive), causing the array to attempt to bring drive 15 (hot spare) online. Drive 15 vanishes from the group manager and the LEDs on the drive turn off.
At this point, we attempted to reseat drive 15. Under normal circumstances, this would mark the drive as a hot spare, but nothing happened…except that the reporting logs indicated that the drive had been removed and then reinserted. Yet there was no drive there. We suspected the controller might be at fault, so we pulled controller 1, forcing an emergency failover to controller 0, but the drive didn’t come back, so we rebooted. Since controller 1 was not physically attached to the array, the unit was forced to boot with controller 0.
Once the array rebooted, drive 15 suddenly reappeared, ready to go as a hot spare. We reinserted controller 1 and rebooted the array again, which booted to controller 1 this time, and drive 15 vanished. Sounds like a problem with the controller. I got on the phone with Dell and was able to get a replacement controller (it turns out that when the case has been open for three weeks, they give you what you want pretty quickly). The controller showed up the next day, and after shutting down the array, replacing the controller (and remembering to move the flash memory from the original controller to the replacement controller), and booting the array back up, all the drives came back online.
Ultimately, it looks like this entire problem was the result of a controller going bad. Now if we can just get it so that we stop dropping disks on the arrays we order (we’ve lost somewhere between 8 to 12 disks across 4 arrays now) we should be all set.
Leave a Reply