SkyCluster Health Checker

SkyCluster Health Checker tool performs comprehensive checking of multiple points across storage, network, OS, and other components to identify any errors, misconfiguration, or risk items. The tool can be executed on any node of the cluster and will perform checks on all nodes in the cluster. Execution of the tool is non-disruptive and can be performed on a live cluster.

# skycluster-health-check -h
usage: skycluster-health-check [-h] [--version] [command] ...

FlashGrid HealthCheck CLI

optional arguments:
  -h, --help           show this help message and exit
  --version            show program's version number and exit

Commands:
  [command]               Default: show
    show                  Show cluster status
    reset-rpm-list        Reset rpm list
    reset-cfg-list        Reset list of cfg files
    reset-services-list   Reset services list

SkyCluster Health Checker performs the following checks:

  • ASM DiskGroup status - checks mount status, redundancy, total and free MB, offline and lost disks, resync and readlocal status, voting file info
  • Alerts in Storage Fabric logs in the last 7 days - checks alerts in log files under /opt/flashgrid/log and /opt/flashgrid-diags/log directories
  • Available memory - checks that available memory is >20%: $ cat /proc/meminfo | grep Available
  • Check db memory settings - shows database memory related parameters, such as memory_max_target, memory_target, sga_max_size, pga_aggregate_target, pga_aggregate_limit (db v12.1 or higher). Also does the total database memory allocation check across all databases.
  • Check local_listener for each db - checks that LOCAL_LISTENER = 'NodeFQDN'
  • Check tnsnames.ora - confirms correct (unmodified) entries in tnsnames.ora for DONOTDELETE,NODEFQDN alias.
  • Flashgrid CLAN check – verifies status of flashgrid-clan service
  • Free system disk space - inspects that disk free space is >30% on all nodes on / and /u01 (if exists) mounts
  • Kernel taint check – checks suspicious errors/warnings (Oops, "process stuck for 120 seconds”, etc.) in various logs (during the last 1 week back or last reboot)
  • SF node status – checks if flashgrid-node status is good
  • Storage Fabric cluster verification status – shows status of flashgrid-cluster verify command
  • Swap disabled – checks if swap is disabled
  • System config file modifications – detects changes in critical cfg files after install or last boot
    • # skycluster-health-check reset-cfg-list resets list of cfg files
  • System services – checks and lists failed system services
  • Unexpected or 3rd party RPMs installed – verifies if non-standard RPMs are installed after the tool was installed or list was reset.
    • # skycluster-health-check reset-rpm-list command will regenerate the list of installed RPMs
  • Unexpected or 3rd party services enabled - examines for non-standard services enabled after the tool was installed or list was reset.
    • # skycluster-health-check reset-services-list regenerates the list of enabled services

Sample report from a two-node cluster:

# skycluster-health-check
HealthCheck 20.4.33.71909 #4077668bfae738b12c9ddc900f2262693c85c566
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check: ASM DiskGroup status
    rac1: WARNING
        ---------------------------------------------------------------------------------------------------------
        GroupName  Status   Mounted   Type    TotalMiB  FreeMiB  OfflineDisks  LostDisks  Resync  ReadLocal  Vote
        ---------------------------------------------------------------------------------------------------------
        DATA       Good     AllNodes  NORMAL  40960     27224    0             0          No      Enabled    None
        FRA        Warning  AllNodes  NORMAL  30720     30381    0             0          Yes     Enabled    None
        GRID       Good     AllNodes  NORMAL  10240     9288     0             0          No      Enabled    3/3
        ---------------------------------------------------------------------------------------------------------
    rac2: WARNING
        ---------------------------------------------------------------------------------------------------------
        GroupName  Status   Mounted   Type    TotalMiB  FreeMiB  OfflineDisks  LostDisks  Resync  ReadLocal  Vote
        ---------------------------------------------------------------------------------------------------------
        DATA       Good     AllNodes  NORMAL  40960     27224    0             0          No      Enabled    None
        FRA        Warning  AllNodes  NORMAL  30720     30381    0             0          Yes     Enabled    None
        GRID       Good     AllNodes  NORMAL  10240     9288     0             0          No      Enabled    3/3
        ---------------------------------------------------------------------------------------------------------

Check: Alerts in Storage Fabric logs in the last 7 days
    rac1: WARNING : /opt/flashgrid/log/fg-cluster-error.log: 53 alerts
    rac2: WARNING : /opt/flashgrid/log/fg-cluster-error.log: 54 alerts
    racq: WARNING : /opt/flashgrid/log/fg-cluster-error.log: 53 alerts

Check: Available memory
    rac1: WARNING : avail mem: 15.4%
    rac2: OK : avail mem: 27.7%
    racq: OK : avail mem: 75.5%

Check: Check db memory settings
    rac1: WARNING
        All DBs: sum(pga_aggregate_limit) + max(HugePages, sum(sga_max_size)) >= TotalMemory - 12 GiB
               :   sum(pga_aggregate_limit) = 4 GiB
               :   HugePages = 17 GiB
               :   sum(sga_max_size) = 0 GiB
               :   TotalMemory = 31 GiB
    rac2: WARNING : Failed to query the db instance 'orcl2'. Check that it is running.

Check: Check local_listener for each db
    rac1: OK
    rac2: WARNING : Failed to query the db instance 'orcl2'. Check that it is running.

Check: Check tnsnames.ora
    rac1: OK : Warning: Multiple listener endpoints detected. Skipping tnsnames.ora check.
    rac2: OK : Warning: Multiple listener endpoints detected. Skipping tnsnames.ora check.

Check: Flashgrid CLAN check
    rac1: OK
    rac2: OK
    racq: OK

Check: Free system disk space
    rac1: OK : /u01: avail 59%, /: avail 85%
    rac2: OK : /u01: avail 60%, /: avail 89%
    racq: OK : /: avail 89%

Check: Kernel taint check
    rac1: OK
    rac2: OK
    racq: OK

Check: SF node status
    rac1: OK
    rac2: OK
    racq: OK

Check: Storage Fabric cluster verification status
    rac1: OK
    rac2: OK
    racq: OK

Check: Swap disabled
    rac1: OK : Swap disabled
    rac2: OK : Swap disabled
    racq: OK : Swap disabled

Check: System config file modifications
    rac1: WARNING
        Checksum file not found, using fg_setup.log modification time instead.
        /etc/dnsmasq.conf   modified since install
    rac2: WARNING
        Checksum file not found, using fg_setup.log modification time instead.
        /etc/sysconfig/iptables modified since install
    racq: WARNING
        Checksum file not found, using fg_setup.log modification time instead.
        /etc/ssh/sshd_config    modified since install

Check: System services
    rac1: OK
    rac2: OK
    racq: OK

Check: Unexpected or 3rd party RPMs installed
    rac1: OK
    rac2: OK
    racq: WARNING : telnet

Check: Unexpected or 3rd party services enabled
    rac1: OK
    rac2: OK
    racq: OK