How to replace a failed Disk which is under Veritas Volume Manager Control.
Issue
How to replace a failed Disk which is under Volume Manager control.
Solution
There are several ways
to replace a failed Disk.
·
Disk failed, but came
back online after some time.
·
Disk failed, corrupt,
needs to be replaced, and spare Disk is already available in the configuration.
·
Disk failed, corrupt,
needs to be replaced, and new Disk needs to be added to the configuration.
Disk failed, but came back online after some time.
Let's assume we have the following configuration.
Two Disks (c1t13d0, c1t14d0) in a Disk group (testdg), one mirrored Volume
(testvol01):
c1t13d0s2 auto:cdsdisk
disk01 testdg online
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:none
- -
online invalid
dg testdg default
default 24000 1223026544.30.jerome
dm disk01 c1t13d0s2
auto 65536 35774960 -
dm disk02 c1t14d0s2
auto 65536 35774960 -
v testvol01 -
ENABLED ACTIVE 2097152 SELECT
- fsgen
pl testvol01-01 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk01-01 testvol01-01 disk01
0 2097152 0
c1t13d0 ENA
pl testvol01-02 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk02-02 testvol01-02 disk02
2097152 2097152 0 c1t14d0
ENA
Now, c1t13d0 went offline caused by a SAN issue.
c1t13d0s2 auto
- -
error
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:none
- -
online invalid
- -
disk01 testdg
failed was:c1t13d0s2
dg testdg default
default 24000 1223026544.30.jerome
dm disk01 -
- -
- NODEVICE
dm disk02 c1t14d0s2
auto 65536 35774960 -
v testvol01 -
ENABLED ACTIVE 2097152 SELECT
- fsgen
pl testvol01-01 testvol01
DISABLED NODEVICE 2097152 CONCAT -
RW
sd disk01-01 testvol01-01 disk01
0 2097152 0
- RLOC
pl testvol01-02 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk02-02 testvol01-02 disk02
2097152 2097152 0 c1t14d0
ENA
You can see that c1t13d0 is now showing as failed, and the Plex is disabled.
After some time, the SAN was back online, and the Disk was available again on
the System.
c1t13d0s2 auto:cdsdisk
- (testdg) online
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:none
- -
online invalid
- -
disk01 testdg
failed was:c1t13d0s2
You can see it is now back online, but still showing as failed.
All we need to do is run
# vxreattach -c c1t13d0s2
to check if we can reattach the Disk.
Now we can run a
# vxreattach -rb c1t13d0s2
This will reattach the Disk to its old Disk Media name inside its old Disk
group, and run a vxrecover in the background if needed.
We can check that with
# vxtask list
TASKID PTID TYPE/STATE PCT
PROGRESS
206
PARENT/R 0.00% 1/0(1) VXRECOVER disk01 testdg
207 207 ATCOPY/R
04.10% 0/2097152/86016 PLXATT testvol01 testvol01-01 testdg
We should be back to normal after the vxrecover is finished.
c1t13d0s2 auto:cdsdisk
disk01 testdg online
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:none
- -
online invalid
Disk failed, corrupt, needs to be replaced, and new Disk needs to be
added to the configuration.
We are taking the same configuration as before, but this time the Disk is
corrupt and needs to be replaced.
c1t13d0s2 auto
- -
error
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:none
- -
online invalid
- -
disk01 testdg
failed was:c1t13d0s2
As we already have a disk available (c1t15d0), we can use that for replacement.
There are two ways to achieve this.
vxdiskadm, which will help you get the Disk replaced by running all the needed
commands in the background.
Or you can run the commands by yourself. I will show you both.
First the vxdiskadm way:
Run vxdiskadm and select option 5.
# vxdiskadm
Volume Manager Support Operations
Menu: VolumeManager/Disk
1 Add or initialize one or
more disks
2 Encapsulate one or more
disks
3 Remove a disk
4 Remove a disk for
replacement
5 Replace a failed or
removed disk
6 Mirror volumes on a disk
7 Move volumes from a disk
8 Enable access to
(import) a disk group
9 Remove access to
(deport) a disk group
10 Enable (online) a disk device
11 Disable (offline) a disk
device
12 Mark a disk as a spare for a
disk group
13 Turn off the spare flag on a
disk
14 Unrelocate subdisks back to a
disk
15 Exclude a disk from
hot-relocation use
16 Make a disk available for
hot-relocation use
17 Prevent multipathing/Suppress
devices from VxVM's view
18 Allow multipathing/Unsuppress
devices from VxVM's view
19 List currently
suppressed/non-multipathed devices
20 Change the disk naming scheme
21 Get the newly connected/zoned
disks in VxVM view
22 Change/Display the default
disk layouts
23 Mark a disk as
allocator-reserved for a disk group
24 Turn off the
allocator-reserved flag on a disk
list List disk information
? Display help about menu
?? Display help about the
menuing system
q Exit from menus
Select an operation to perform: 5
On the next page we select the Disk to be replaced, and the Disk which we are
going to use for the replacement
Replace a failed or removed disk
Menu: VolumeManager/Disk/ReplaceDisk
Use this menu operation to specify a
replacement disk for a disk
that you removed with the "Remove a
disk for replacement" menu
operation, or that failed during use.
You will be prompted for
a disk name to replace and a disk device
to use as a replacement.
You can choose an uninitialized disk, in
which case the disk will
be initialized, or you can choose a disk
that you have already
initialized using the Add or initialize
a disk menu operation.
Select a removed or failed disk
[<disk>,list,q,?] list
Disk group: testdg
DM NAME DEVICE
TYPE PRIVLEN PUBLEN STATE
dm disk01 -
- -
- NODEVICE
Select a removed or failed disk
[<disk>,list,q,?] disk01
The following devices are available as
possible replacements after being
initialized (or reinitiliazed):
c0t0d0 c1t15d0
You can choose one of these devices to
replace disk01.
Choose "none" to abort the
replacement of disk01.
Choose a device, or select "none"
[<device>,none,q,?] (default:
c0t0d0) c1t15d0
VxVM INFO V-5-2-378
The requested operation is to initialize disk
device c1t15d0 and
to then use that device to replace the
removed or failed disk
disk01 in disk group testdg.
Continue with operation? [y,n,q,?] (default:
y) y
Use FMR for plex resync? [y,n,q,?] (default:
n) n
VxVM INFO V-5-2-282
Replacement of disk disk01 in group testdg
with disk device
c1t15d0 completed successfully.
Replace another disk? [y,n,q,?] (default:
n) n
Now, what it does is initialize the Disk, add it to the Disk group with the old
Disk Media name, and run a vxrecover on the Volume.
c1t13d0s2 auto
- -
error
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:cdsdisk
disk01 testdg online
dg testdg default
default 24000 1223026544.30.jerome
dm disk01 c1t15d0s2
auto 65536 35774960 -
dm disk02 c1t14d0s2
auto 65536 35774960 -
v testvol01 -
ENABLED ACTIVE 2097152 SELECT
- fsgen
pl testvol01-01 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk01-01 testvol01-01 disk01
0 2097152 0
c1t15d0 ENA
pl testvol01-02 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk02-02 testvol01-02 disk02
2097152 2097152 0 c1t14d0
ENA
The only thing left here is to remove the disk to get it physically replaced.
You can do this with option 3 in vxdiskadm.
I will show you how to add a new Disk in the next scenario.
Now, the same can be done manually on the command line.
If we go back to our problem
c1t13d0s2 auto
- -
error
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:none
- -
online invalid
- -
disk01 testdg
failed was:c1t13d0s2
Here are the steps needed to replace the Disk on the command line.
We will initialize the Disk, add it to the Disk group with the same Media Name,
and then run a recovery in the background
# vxdisksetup -i c1t15d0 format=cdsdisk
# vxdg -g testdg -k adddisk disk01=c1t15d0s2
# vxrecover -b testvol01
After the recovery is done
c1t13d0s2 auto
- -
error
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:cdsdisk
disk01 testdg online
dg testdg default
default 24000 1223026544.30.jerome
dm disk01 c1t15d0s2
auto 65536 35774960 -
dm disk02 c1t14d0s2
auto 65536 35774960 -
v testvol01 -
ENABLED ACTIVE 2097152 SELECT
- fsgen
pl testvol01-01 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk01-01 testvol01-01 disk01
0 2097152 0
c1t15d0 ENA
pl testvol01-02 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk02-02 testvol01-02 disk02
2097152 2097152 0 c1t14d0
ENA
we need to remove the corrupt Disk and physically replace it.
I will show you how to add a new Disk in the next scenario.
# vxdisk rm c1t13d0s2
Disk failed, corrupt, needs to be replaced, and new Disk needs to be
added to the configuration.
In the last two scenarios we either replaced the failed Disk with the same
Disk, or one which was already added to the configuration.
Now I will show you how to replace a failed Disk by physically removing the
failed one, and get it replaced by a new Disk.
Here is our configuration:
c1t13d0s2 auto:cdsdisk
disk01 testdg online
c1t14d0s2 auto:cdsdisk disk02
testdg online
dg testdg default
default 24000 1223026544.30.jerome
dm disk01 c1t13d0s2
auto 65536 35774960 -
dm disk02 c1t14d0s2
auto 65536 35774960 -
v testvol01 -
ENABLED ACTIVE 2097152 SELECT
- fsgen
pl testvol01-01 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk01-01 testvol01-01 disk01
0 2097152 0
c1t13d0 ENA
pl testvol01-02 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk02-02 testvol01-02 disk02
2097152 2097152 0 c1t14d0
ENA
Now, again, c1t13d0 fails.
c1t13d0s2 auto
- -
error
c1t14d0s2 auto:cdsdisk
disk02 testdg online
- -
disk01 testdg
failed was:c1t13d0s2
We first need to remove the failed disk for replacement.
This is option 4 in vxdiskadm
Volume Manager Support Operations
Menu: VolumeManager/Disk
1 Add or initialize one or
more disks
2 Encapsulate one or more
disks
3 Remove a disk
4 Remove a disk for
replacement
5 Replace a failed or
removed disk
6 Mirror volumes on a disk
7 Move volumes from a disk
8 Enable access to
(import) a disk group
9 Remove access to
(deport) a disk group
10 Enable (online) a disk device
11 Disable (offline) a disk
device
12 Mark a disk as a spare for a
disk group
13 Turn off the spare flag on a
disk
14 Unrelocate subdisks back to a
disk
15 Exclude a disk from
hot-relocation use
16 Make a disk available for
hot-relocation use
17 Prevent multipathing/Suppress
devices from VxVM's view
18 Allow multipathing/Unsuppress
devices from VxVM's view
19 List currently suppressed/non-multipathed
devices
20 Change the disk naming scheme
21 Get the newly connected/zoned
disks in VxVM view
22 Change/Display the default
disk layouts
23 Mark a disk as
allocator-reserved for a disk group
24 Turn off the allocator-reserved
flag on a disk
list List disk information
? Display help about menu
?? Display help about the
menuing system
q Exit from menus
Select an operation to perform: 4
Then we select the failed disk to be removed.
Remove a disk for replacement
Menu: VolumeManager/Disk/RemoveForReplace
Use this menu operation to remove a
physical disk from a disk
group, while retaining the disk name.
This changes the state
for the disk name to a
"removed" disk. If there are any
initialized disks that are not part of a
disk group, you will be
given the option of using one of these
disks as a replacement.
Enter disk name [<disk>,list,q,?] list
Disk group: testdg
DM NAME DEVICE
TYPE PRIVLEN PUBLEN STATE
dm disk01 -
- -
- NODEVICE
dm disk02 c1t14d0s2
auto 65536 35774960 -
Enter disk name [<disk>,list,q,?] disk01
VxVM NOTICE V-5-2-371
The following volumes will lose mirrors as a
result of this
operation:
testvol01
No data on these volumes will be lost.
VxVM NOTICE V-5-2-381
The requested operation is to remove disk
disk01 from disk group
testdg. The disk name will be
kept, along with any volumes using
the disk, allowing replacement of the
disk.
Select "Replace a failed or removed
disk" from the main menu
when you wish to replace the disk.
Continue with operation? [y,n,q,?] (default:
y) y
VxVM INFO V-5-2-265 Removal of
disk disk01 completed successfully.
Remove another disk? [y,n,q,?] (default:
n) n
After this we should see the following output:
c1t13d0s2 auto
- -
error
c1t14d0s2 auto:cdsdisk
disk02 testdg online
- -
disk01 testdg
removed was:c1t13d0s2
Now you can remove the disk and replace it with another one.
Once you have replaced it, we need to let Volume Manager know that there is a
new disk.
Run
# vxdiskconfig
VxVM INFO V-5-2-1401 This command
may take a few minutes to complete execution
Executing Solaris command: devfsadm
(part 1 of 2) at 10:35:05 BST
Executing VxVM command: vxdctl enable
(part 2 of 2) at 10:35:18 BST
Command completed at 10:35:21 BST
It will use devfsadm to check the OS for new devices, and after that it will
run vxdctl enable to add them to Volume Manager.
Now we should see a new disk.
c1t13d0s2 auto
- -
error
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:none
- -
online invalid
- -
disk01 testdg
removed was:c1t13d0s2
Run vxdiskadm again, and select option 5 to replace the removed disk
Volume Manager Support Operations
Menu: VolumeManager/Disk
1 Add or initialize one or
more disks
2 Encapsulate one or more
disks
3 Remove a disk
4 Remove a disk for
replacement
5 Replace a failed or
removed disk
6 Mirror volumes on a disk
7 Move volumes from a disk
8 Enable access to
(import) a disk group
9 Remove access to
(deport) a disk group
10 Enable (online) a disk device
11 Disable (offline) a disk
device
12 Mark a disk as a spare for a
disk group
13 Turn off the spare flag on a
disk
14 Unrelocate subdisks back to a
disk
15 Exclude a disk from
hot-relocation use
16 Make a disk available for
hot-relocation use
17 Prevent multipathing/Suppress
devices from VxVM's view
18 Allow multipathing/Unsuppress
devices from VxVM's view
19 List currently
suppressed/non-multipathed devices
20 Change the disk naming scheme
21 Get the newly connected/zoned
disks in VxVM view
22 Change/Display the default
disk layouts
23 Mark a disk as
allocator-reserved for a disk group
24 Turn off the
allocator-reserved flag on a disk
list List disk information
? Display help about menu
?? Display help about the
menuing system
q Exit from menus
Select an operation to perform: 5
Now select the removed disk, and the new one to be used as a replacement
Replace a failed or removed disk
Menu: VolumeManager/Disk/ReplaceDisk
Use this menu operation to specify a
replacement disk for a disk
that you removed with the "Remove a
disk for replacement" menu
operation, or that failed during use.
You will be prompted for
a disk name to replace and a disk device
to use as a replacement.
You can choose an uninitialized disk, in
which case the disk will
be initialized, or you can choose a disk
that you have already
initialized using the Add or initialize
a disk menu operation.
Select a removed or failed disk
[<disk>,list,q,?] list
Disk group: testdg
DM NAME DEVICE
TYPE PRIVLEN PUBLEN STATE
dm disk01 -
- -
- REMOVED
Select a removed or failed disk
[<disk>,list,q,?] disk01
The following devices are available as
possible replacements after being
initialized (or reinitiliazed):
c0t0d0 c1t15d0
You can choose one of these devices to
replace disk01.
Choose "none" to abort the
replacement of disk01.
Choose a device, or select "none"
[<device>,none,q,?] (default:
c0t0d0) c1t15d0
VxVM INFO V-5-2-378
The requested operation is to initialize disk
device c1t15d0 and
to then use that device to replace the
removed or failed disk
disk01 in disk group testdg.
Continue with operation? [y,n,q,?] (default:
y) y
Use FMR for plex resync? [y,n,q,?] (default:
n) n
VxVM INFO V-5-2-282
Replacement of disk disk01 in group testdg
with disk device
c1t15d0 completed successfully.
Replace another disk? [y,n,q,?] (default:
n) n
It will automatically run a vxrecover in the background.
Once this is done, we can remove the old disk entry.
# vxdisk rm c1t13d0s2
And then we should be back to normal
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:cdsdisk
disk01 testdg online
dg testdg default
default 24000 1223026544.30.jerome
dm disk01 c1t15d0s2
auto 65536 35774960 -
dm disk02 c1t14d0s2
auto 65536 35774960 -
v testvol01 -
ENABLED ACTIVE 2097152 SELECT
- fsgen
pl testvol01-01 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk01-01 testvol01-01 disk01
0 2097152 0
c1t15d0 ENA
pl testvol01-02 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk02-02 testvol01-02 disk02
2097152 2097152 0 c1t14d0
ENA
You can do the same from the command line.
Instead of using option 4 in vxdiskadm, you use
# vxdg -g testdg -k rmdisk disk01
to remove the disk, then run
# vxdiskconfig
# vxdisksetup -i c1t15d0 format=cdsdisk
to get the OS to scan for the new disk, add it to Volume Manger, and initialize
it for use.
Then you run
# vxdg -g testdg -k adddisk disk01=c1t15d0s2
# vxrecover -b testvol01
# vxdisk rm c1t13d0s2
to add the new disk with the old Disk Media name, recover the mirror and remove
the old disk entry from Volume Manager.
Once the recover is done, we should see the following:
c1t14d0s2 auto:cdsdisk
disk02 testdg online
c1t15d0s2 auto:cdsdisk
disk01 testdg online
dg testdg default
default 24000 1223026544.30.jerome
dm disk01 c1t15d0s2
auto 65536 35774960 -
dm disk02 c1t14d0s2
auto 65536 35774960 -
v testvol01 -
ENABLED ACTIVE 2097152 SELECT
- fsgen
pl testvol01-01 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk01-01 testvol01-01 disk01
0 2097152 0
c1t15d0 ENA
pl testvol01-02 testvol01 ENABLED
ACTIVE 2097152 CONCAT -
RW
sd disk02-02 testvol01-02 disk02
2097152 2097152 0 c1t14d0
ENA