Help. Replaced unavail disk with new disk but still unavail - S11.3 zpool
Hi. We have a Solaris 11.3 system that was purchased many years ago. The company is no longer dealing with Solaris installs.
I am the sysadmin but know near nothing about Solaris, but do know a bit of Linux. Our Solaris unit has 36 SAS hard drives. One of them in tank1 has a red light (c15t1d30). I have replaced it with another, exactly the same, brand new drive. There is also another slot that went through the same issue about a year ago, which is in the same situation. So would like to fix that also (c15t1d8). My question is, can someone please help me bring it (c15t1d30) back online. I have followed the oracle instructions on replacing a drive with a new one in the same slot, and nothing I seem to do can bring it back. Drive still remains in unavail state with red light on. Below is my 'zpool status -v' output and a list of the things I have tried: Code:
# zpool status -v What I have tried: Code:
# zpool offline tank1 c15t1d30 Code:
# zpool replace tank1 c15t1d30 - zpool status -v showed no change at all. ie still unavail Code:
# zpool clear tank1 c15t1d30 Code:
# zpool online tank1 c15t1d30 - put old drive back in then: Code:
# devfsadm -Cv - replace with new hard drive again Code:
# devfsadm -Cv Code:
# zpool replace tank1 c15t1d30 (again) Code:
# fmadm faulty Code:
# fmadm repaired zfs://pool=64513e8f0e484ee2/vdev=ea0091c4caa611ec/pool_name=tank1/vdev_name=id1,sd@n600100404f361ca0a2900c8d00000000/a Current state same as original condition except "write erros 0": Code:
# zpool status -v So basically, can anyone help get this drive back online or shed any insight? Thanks! Jono |
IS THIS SERVER PERFORMING A CRITICAL FUNCTION?
Before taking ANY steps I would evaluate what investment it is worth to keep this going, and recover it as it was oor replace it entirely. One past that: Step one would to be absolutely certain that you have everything about the system well documented and multiple verified full backups of all critical data. One would hope you do this regularly anyway, but when hardware starts failing it becomes immediate and critical. Step two: If the recovery and replacment steps are not working, there is almost certainly a good reason. If it is in hardware, there may not be a great replacement plan. I would verify the hardware (this requires a hardware engineer familiar with that platform). A field engineer may also have recovery advice and pointers to documentation that will help you. Step three, while awaiting the Engineer get working on a full platform replacement plan. I do not think anyone can purchase that new today, and would bet it is long out of support. That means someone should have planned the replacement long ago. Since they did not it now falls to you. I would investigate HP servers and consider RHEL or SUSE for solid and supported operating systems that should support anything that generation of Solaris server could do, although not quite in the same ways. Also, while local storage has advantages there are very fast SAN options that can be faster and more reliable than any local storage that is not SSD based. (Also, if you go local SSD storage, know that it only takes seven SSD drives to max out the channel bandwidth of a fast/wide SCSI controller and turn it into a choke-point. If you need to go for maximum performance and choose SSD, you will need to limit the active drives per controller to six. With that in mind the SAN with rotational drives but LOTS of cache may be the better deal) This plan will not be wasted. You should use the plan no matter what happens with the server, but if the engineer can provide you a path to recovery on the old hardware then you can take longer to plan, budget, and sell the migration plan. Step four: react to what you learn from the engineer. The state of the hardware and available parts and options will dictate the direction of your next steps. |
Thanks so much for your reply.
The server is more of a storage server with the 'not so critical' data on it. I should be clear that it has not failed. All the data is safe. I have backups as well. I just need to repair the 2 disks that have died in it. I have physically replaced the failed drives with brand new, exactly the same drives, and thought I followed the instructions from Oracle correctly, but the 2 drives remain in an 'unavail' state. That is the bit I cannot figure out. It may just be related to this message, which I do not know what means: Code:
cannot label 'c15t1d30': try using fdisk(1M) and then provide a specific slice Best Jono |
Is this an x86 machine ?
If yes then you have to invoke format and fdisk before solaris can use the disk. Take a look at this linkhttps://docs.oracle.com/cd/E19683-01...qva/index.html |
All times are GMT -5. The time now is 08:42 PM. |