LVM VG Metadata Corruption with ‘Checksum error’

The Problem

“Checksum errors” are reported when running LVM commands on CentOS/RHEL server.

# vgs
 /dev/mapper/cx0009_lun45: Checksum error
 /dev/mapper/cx0009_lun48: Checksum error
 VG #PV #LV #SN Attr VSize VFree 
 vg00 1 7 0 wz--n- 279.12G 159.12G
 vgcommrmandb 1 6 0 wz--n- 20.00G 44.00M
 vgcotsoracle 1 1 0 wz--n- 20.00G 4.00M
 vgcotsorapit 1 1 0 wz--n- 50.00G 4.00M
...
# lvs
 /dev/mapper/cx0009_lun45: Checksum error
 /dev/mapper/cx0009_lun48: Checksum error
 LV VG Attr LSize Origin Snap% Move Log Copy% Convert
 crashvol vg00 -wi-ao 64.00G 
 homevol vg00 -wi-ao 4.00G 
 oemagentvol vg00 -wi-ao 10.00G 
 rootvol vg00 -wi-ao 10.00G 
 swapvol vg00 -wi-ao 16.00G 
 tmpvol vg00 -wi-ao 8.00G 
...
# pvs
 /dev/mapper/cx0009_lun45: Checksum error
 /dev/mapper/cx0009_lun48: Checksum error
 PV VG Fmt Attr PSize PFree 
 /dev/cciss/c0d0p2 vg00 lvm2 a-- 279.12G 159.12G
 /dev/mapper/cx0008_lun37 vgeflxwmq lvm2 a-- 5.00G 1.00G
 /dev/mapper/cx0009_lun30 vgeflxjvastb lvm2 a-- 40.00G 8.04G
 /dev/mapper/cx0009_lun31 vgeflxhdb1arch lvm2 a-- 60.00G 20.00M

Solution

A checksum is stored in the LVM2 metadata so that the presence of corruption can be detected before data can be actually damaged. This issue will generally occur when that checksum does not match a checksum calculated after having the metadata processed.

There may be many different causes of the checksum error, some of those include the following:

  1. If 2 hosts are independently attempting to update the LVM2 metadata at the same time (ie. like a cluster situation) and clustered LVM (clvm) is not in use.
  2. I/O errors that occur while the metadata is being updated (LVM2 updates are not journaled so interruptions in I/O updates can cause corruption).
  3. There is some issue in SAN environment in case underlying paths are coming from SAN.

In order to resolve this error, follow the steps given below:

1. Backup all your data on the logical volumes.

2. Stop all services that have LVM resources in them (so the volumes can be unmounted and the volume groups deactivated). The service should not be running on any nodes in the cluster (if the error is reported on the cluster).

3. Restore the metadata using the command ‘vgcfgrestore‘. LVM meta data backup files are stored in /etc/lvm/backup and /etc/lvm/archive. The vgcfgrestore command by default uses the backup file in /etc/lvm/backup. Run vgcfgrestore to restore the LVM meta data. For example,

# vgcfgrestore vg_os
/dev/mapper/cx0009_lun45: Checksum error
/dev/mapper/cx0009_lun48: Checksum error
Restored volume group vg_os

4. Activate the volume group.

# vgchange -ay vg_os
1 logical volume(s) in volume group "vg_os" now active

5. Run “pvscan” command to verify if you can see the “checksum errors” now.

# pvscan

6. Re-enable any services that were stopped prior to the vgcfgrestore.

Conclusion

Using vgcfgrestore can restore a backup of the LVM metadata to the LVM physical volumes from before the corruption occurred. You can use the default backup file in /etc/lvm/backup to restore the meta data from an old backup. If you have a backup file in other location you can also specify the backup file with the vgcfgrestore command as shown below.

# vgcfgrestore -f /path/to/backup/file vgname
Related Post