• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

The Geek Diary

HowTos | Basics | Concepts

  • Solaris
    • Solaris 11
    • SVM
    • ZFS
    • Zones
    • LDOMs
    • Hardware
  • Linux
    • CentOS/RHEL 7
    • RHCSA notes
    • SuSE Linux Enterprise
    • Linux Services
  • VCS
    • VxVM
  • Interview Questions
  • oracle
    • ASM
    • mysql
    • RAC
    • oracle 12c
    • Data Guard
  • DevOps
    • Docker
    • Shell Scripting
  • Hadoop
    • Hortonworks HDP
      • HDPCA
    • Cloudera
      • CCA 131

Troubleshooting Oracle RAC Node Evictions (Reboots) [ 11.2 and above ]

By admin

This post provides a reference for troubleshooting Clusterware node evictions in versions 11.2 and above.

NODE EVICTION OVERVIEW

The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected. A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process. The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.

Starting in 11.2.0.2 RAC or above (or if you are on Exadata), a node eviction may not actually reboot the machine. This is called a rebootless restart. In this case, we restart most of the Clusterware stack to see if that fixes the unhealthy node.

PROCESS ROLES FOR REBOOTS

OCSSD (aka CSS daemon) – This process is spawned by the cssdagent process. It runs in both vendor clusterware and non-vendor clusterware environments. OCSSD’s primary job is internode health monitoring and RDBMS instance endpoint discovery. The health monitoring includes a network heartbeat and a disk heartbeat (to the voting files). OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process). This is a multi-threaded process that runs at an elevated priority and runs as the Oracle user.

Startup sequence: 
INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin

CSSDAGENT – This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.

Startup sequence: 
INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent

CSSDMONITOR – This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.

Startup sequence: 
INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor

Determining which process is responsible for the reboot

Important files to review:

Clusterware alert log in:

The cssdagent log(s)
The cssdmonitor log(s)
The ocssd log(s)
The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
CHM or OS Watcher data
'opatch lsinventory -detail' output for the GRID home
Messages files

Messages file locations:

Linux: /var/log/messages
Sun /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
IBM: /bin/errpt -a > messages.out

11.2 Clusterware evictions should, in most cases, have some kind of meaningful error in the clusterware alert log. This can be used to determine which process is responsible for the reboot. Example message from a clusterware alert log:

[ohasd(11243)]CRS-8011:reboot advisory message from host: sta00129, component: cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653

This particular eviction happened when we had hit the network timeout. CSSD exited and the cssdagent took action to evict. The cssdagent knows the information in the error message from local heartbeats made from CSSD.

If no message is in the evicted node’s clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes.

Troubleshooting OCSSD evictions

If you have encountered an OCSSD eviction review common causes in section below.

COMMON CAUSES OF OCSSD EVICTIONS

  • Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default – determined by the CSS misscount) to cause a node eviction.
  • Problems writing to or reading from the CSS voting disk. If the node cannot perform a disk heartbeat to the majority of its voting files, then the node will be evicted.
  • A member kill escalation. For example, database LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism. If this times out it could escalate to a node kill.
  • An unexpected failure or hang of the OCSSD process, this can be caused by any of the above issues or something else.
  • An Oracle bug.

File to review and father for OCSSD evictions

All files from section “Determining which process is responsible for a reboot” from all cluster nodes. More data may be required.

Example of eviction due to “loss of voting disk”

CSS log:

2012-03-27 22:05:48.693: [ CSSD][1100548416](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured voting disks available, need 2
2012-03-27 22:05:48.693: [ CSSD][1100548416]###################################
2012-03-27 22:05:48.693: [ CSSD][1100548416]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread

OS messages:

Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:All paths to Symm 000190104720 vol 0c71 are dead.
Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:Symm 000190104720 vol 0c71 is dead.
Mar 27 22:03:58 choldbr132p kernel: Buffer I/O error on device sdbig, logical block 0
...

Troubleshooting CSSDAGENT or CSSDMONITOR evictions

If you have encountered a CSSDAGENT or CSSDMONITOR eviction review common causes in section below.

Common causes of CSSDAGENT or CSSDMONITOR evictions

  • An OS scheduler problem. For example, if the OS is getting locked up in a driver or hardware or there is excessive amounts of load on the machine (at or near 100% cpu utilization), thus preventing the scheduler from behaving reasonably.
  • A thread(s) within the CSS daemon hung.
  • An Oracle bug.

File to review and gather for CSSDAGENT or CSSDMONITOR evictions

All files from section “Determining which process is responsible for a reboot” from all cluster nodes. More data may be required.

Filed Under: oracle, RAC

Some more articles you might also be interested in …

  1. 12c ASM: PRCR-1001 : Resource ora.proxy_advm Does Not Exist (Flex ASM with Cardinality = ALL)
  2. How To Calculate The Required Network Bandwidth Transfer Of Redo In Data Guard Environments
  3. How to split BCV and open oracle ASM database
  4. CentOS / RHEL 7 : Oracleasm Create Disk Failed “Instantiating disk: failed”
  5. Script to verify the Oracle DataPump Data Dictionary Catalog
  6. ORA-12518: TNS:listener Could Not Hand Off Client Connection
  7. Oracle sql script to report the list of files stored in ASM and CURRENTLY NOT OPENED
  8. Why Can I Login AS SYSDBA With Any Username and Password
  9. How to duplicate a Oracle Database to a previous Incarnation
  10. How To Resize An ACFS Filesystem/ASM Volume (ADVM)

You May Also Like

Primary Sidebar

Recent Posts

  • “su: Authentication failure” – in Docker
  • How to Pause and Resume Docker Containers
  • How to find docker storage device and its size (device mapper storage driver)
  • Understanding “docker stats” Command Output
  • ‘docker images’ command error – “Permission Denied”
  • Archives
  • Contact Us
  • Copyright

© 2019 · The Geek Diary