SkyCluster Health Checker

SkyCluster Health Checker tool performs comprehensive checking of multiple points across storage, network, OS, and other components to identify any errors, misconfiguration, or risk items. The tool can be executed on any node of the cluster and will perform checks on all nodes in the cluster. Execution of the tool is non-disruptive and can be performed on a live cluster.

# skycluster-health-check -h
usage: skycluster-health-check [-h] [--version] [command] ...

FlashGrid HealthCheck CLI

optional arguments:
  -h, --help           show this help message and exit
  --version            show program's version number and exit

Commands:
  [command]                         Default: show
    show                    Show cluster status
    reset-rpm-list      Reset rpm list
    reset-services-list.           Reset services list

SkyCluster Health Checker performs the following checks:

  • ASM DiskGroup status - checks mount status, redundancy, total and free MB, offline and lost disks, resync and readlocal status, voting file info
  • Available memory - checks that available memory is >20%: $ cat /proc/meminfo | grep Available
  • Flashgrid CLAN check – verifies status of flashgrid-clan service
  • Flashgrid logs check – examines alerts in node_monitor log (during the last 1 week back)
  • Free system disk space - inspects that disk free space is >30% on all nodes on / and /u01 (if exists) mounts
  • Kernel taint check – checks suspicious errors/warnings (Oops, "process stuck for 120 seconds”, etc.) in various logs (during the last 1 week back or last reboot)
  • SF node status – checks if flashgrid-node status is good
  • Storage Fabric cluster verification status – shows status of flashgrid-cluster verify command
  • Swap disabled – checks if swap is disabled
  • System config file modifications – detects changes in critical cfg files after install or last boot
  • System services – checks and lists failed system services
  • Unexpected or 3rd party RPMs installed – verifies if non-standard RPMs are installed after the tool was installed or list was reset.
    • # skycluster-health-check reset-rpm-list command will regenerate the list of installed RPMs
  • Unexpected or 3rd party services enabled - examines for non-standard services enabled after the tool was installed or list was reset.
    • # skycluster-health-check reset-services-list regenerates the list of enabled services

Sample report from a two-node cluster:

# skycluster-health-check

HealthCheck 20.2.28.65974 #6de6f20c8bdf0379d365d9e17331bf5d8fd0a059
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check: ASM DiskGroup status
    rac1: OK
        --------------------------------------------------------------------------------------------------------
        GroupName  Status  Mounted   Type    TotalMiB  FreeMiB  OfflineDisks  LostDisks  Resync  ReadLocal  Vote
        --------------------------------------------------------------------------------------------------------
        DATA       Good    AllNodes  NORMAL  40960     22840    0             0          No      Enabled    None
        GRID       Good    AllNodes  NORMAL  10240     9464     0             0          No      Enabled    3/3
        --------------------------------------------------------------------------------------------------------
    rac2: OK
        --------------------------------------------------------------------------------------------------------
        GroupName  Status  Mounted   Type    TotalMiB  FreeMiB  OfflineDisks  LostDisks  Resync  ReadLocal  Vote
        --------------------------------------------------------------------------------------------------------
        DATA       Good    AllNodes  NORMAL  40960     22840    0             0          No      Enabled    None
        GRID       Good    AllNodes  NORMAL  10240     9464     0             0          No      Enabled    3/3
        --------------------------------------------------------------------------------------------------------
    racq: OK
        --------------------------------------------------------------------------------------------------------
        GroupName  Status  Mounted   Type    TotalMiB  FreeMiB  OfflineDisks  LostDisks  Resync  ReadLocal  Vote
        --------------------------------------------------------------------------------------------------------
        DATA       Good    AllNodes  NORMAL  40960     22840    0             0          No      Enabled    None
        GRID       Good    AllNodes  NORMAL  10240     9464     0             0          No      Enabled    3/3
        --------------------------------------------------------------------------------------------------------

Check: Available memory
    rac1: OK : avail mem: 20.1%
    rac2: WARNING : avail mem: 19.1%
    racq: OK : avail mem: 69.9%

Check: Flashgrid CLAN check
    rac1: OK
    rac2: OK
    racq: OK

Check: Flashgrid logs check
    rac1: WARNING : /opt/flashgrid-diags/log/node_monitor-error.log: 16 alerts
    rac2: OK
    racq: WARNING : /opt/flashgrid-diags/log/node_monitor-error.log: 8 alerts

Check: Free system disk space
    rac1: OK : /u01: avail 67%, /: avail 88%
    rac2: OK : /u01: avail 67%, /: avail 88%
    racq: OK : /: avail 88%

Check: Kernel taint check
    rac1: OK
    rac2: OK
    racq: OK

Check: SF node status
    rac1: OK
    rac2: OK
    racq: OK

Check: Storage Fabric cluster verification status
    rac1: OK
    rac2: OK
    racq: OK

Check: Swap disabled
    rac1: OK : Swap disabled
    rac2: OK : Swap disabled
    racq: OK : Swap disabled

Check: System config file modifications
    rac1: WARNING
        /etc/resolv.conf    modified since last boot
        /etc/racdns modified since install
    rac2: OK
    racq: WARNING : /etc/racdns modified since install

Check: System services
    rac1: OK
    rac2: OK
    racq: OK

Check: Unexpected or 3rd party RPMs installed
    rac1: OK
    rac2: OK
    racq: OK

Check: Unexpected or 3rd party services enabled
    rac1: OK
    rac2: OK
    racq: OK