Previous Topic: Assess and Recover SCSI Disk CorruptionNext Topic: Recreate the Vertica Metrics Database


Assess and Recover XFS File System Corruption

Applies to CA6000 and CA6300 appliances

Problem: On restart of the appliance, a kernel panic preceded by an XFS code call stack similar to the following is displayed:

RIP [<ffffffff883cf607>] :xfs:xfs_error_report+0xf/0x58
RSP <ffff81028c817c28>
CR2: 0000000000000118
<0> Kernel panic - not syncing - Fatal exception

Resolution: We recommend recovering a corrupted XFS file system as soon as possible. Typically, XFS file system corruption causes a Linux kernel panic and system halt similar to the above.

The CA Multi-Port Monitor appliance uses the high performance Linux XFS file system on two partitions:

XFS file system corruption typically occurs when the appliance experiences a power outage or hardware hang.

The Linux kernel panic is mostly likely to occur on the /nqxfs partition shortly after restarting the appliance when the Vertica metrics database starts. In the following example, the terminal display shows an XFS call stack and kernel panic. Note that the affected partition may not be displayed, but you can safely run xfs_repair on both XFS partitions (/nqxfs and /data) to ensure all XFS file system corruption is repaired: NetQoS--MTP--Recover XFS File System Corruption

Repair a XFS file system to resolve corruption on that file system. If the corruption occurred on the /nqfxs partition, which is where the Vertica metrics database resides, recreate the Vertica metrics database.

More information:

Shut Down or Restart the Appliance

Repair XFS File System Corruption

Applies to CA6000 and CA6300 appliances

Repair a damaged or corrupt XFS file system using the xfs_repair command on the affected partition. After you repair XFS file system corruption on the:

Estimated time to complete XFS repair: 30-60 minutes

Follow these steps:

  1. If the Multi-Port Monitor terminal displays a kernel panic and system halt message, and is unresponsive, shut down the appliance by holding down the Power button for several seconds. Otherwise, shut down the appliance normally.
  2. Press the Power button to start the appliance.
  3. After BIOS scans, the initial CentOS boot screen will appear. Hit any key before the countdown reaches zero seconds to enter the boot menu.
  4. The default boot kernel will already be selected. Press a to modify kernel boot parameters.
  5. The cursor will be at the end of the line of kernel parameters. Add the parameter single to the end of the line, as shown in the example below, and press Enter:

    NetQoS--MTP--Add Single Parameter

  6. When the kernel finishes booting, a command prompt will be displayed. There is no login prompt as the system is running in single user mode.

    Note: In single user mode, the appliance can only be accessed from the terminal display.

  7. To repair the:
    /nqxfs partition for CA6300

    unmount it and execute xfs_repair for its block device:

    umount /nqxfs
    xfs_repair /dev/sdb1
    
    /data partition for CA6300

    unmount it and execute xfs_repair for its block device:

    umount /data
    xfs_repair /dev/sdb2
    
    /nqxfs partition for CA6000

    unmount it and execute xfs_repair for its block device:

    umount /nqxfs
    xfs_repair /dev/sda4
    
    /data partition for CA6000

    unmount it and execute xfs_repair for its block device:

    umount /data
    xfs_repair /dev/sdb1
    
  8. In either case, a successful repair produces text output similar to the following:
    Phase 1 - find and verify superblock...
    Phase 2 - zero log...
            - scan file system freespace and inode maps...
            - found root inode chunk
    Phase 3 - for each AG...
            - scan and clear agi unlinked lists...
            - process known inodes and perform inode discovery...
            - agno = 0
            - agno = 1
            ...
            - process newly discovered inodes...
    Phase 4 - check for duplicate blocks...
            - setting up duplicate extent list...
            - clear lost+found (if it exists) ...
            - clearing existing “lost+found” inode
            - deleting existing “lost+found” entry
            - check for inodes claiming duplicate blocks...
            - agno = 0
    imap claims in-use inode 242000 is free, correcting imap
            - agno = 1
            - agno = 2
            ...
    Phase 5 - rebuild AG headers and trees...
            - reset superblock counters...
    Phase 6 - check inode connectivity...
            - ensuring existence of lost+found directory
            - traversing file system starting at / ... 
            - traversal finished ... 
            - traversing all unattached subtrees ... 
            - traversals finished ... 
            - moving disconnected inodes to lost+found ... 
    disconnected inode 242000, moving to lost+found	
    Phase 7 - verify and correct link counts...
    Done
    
    
  9. Enter reboot to leave single user mode and restart the appliance.
  10. Assess whether the XFS repair has returned the partition to normal operations.

    When restarting the appliance, the partition should no longer trigger a Linux kernel panic.

  11. When repairing the /nqxfs partition, you must also recreate the Vertica metrics database which is hosted on the partition.

More information:

Shut Down or Restart the Appliance

Assess and Recover XFS File System Corruption