Pacemaker Resource showing as "unrunnable start (blocked)" for each cluster resource

tooreply

New Member
Joined
Jul 29, 2019
Messages
7
Reaction score
0
Credits
81
OS and Pacemaker Versions:

PRETTY_NAME="SUSE Linux Enterprise Server 12 SP4" CPE_NAME="cpe:/o:suse:sles_sap:12:sp4"

pacemakerd --version
Pacemaker 1.1.19+20181105.ccd6b5b10-3.16.1

corosync -v
Corosync Cluster Engine, version '2.3.6'

When cluster resources are running on glbgvldbies01, everything is fine, but when glbgvldbies01 is rebooted or down for some reason, the resource are not able to start on glbgvldbies02, We are getting the following error

Resource showing as "unrunnable start (blocked)"

CRM CONFIG:

node 1: glbgvldbies02

node 2: glbgvldbies01 \ attributes maintenance=off

primitive nfs_filesystem Filesystem \ params device="10.10.10.15:/Database" directory="/usr/sap/IAV" fstype=nfs options="rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys" \ op stop timeout=30s interval=0 \ op monitor interval=10s timeout=20s \ op start timeout=30s interval=0 \ meta target-role=Started migration-threshold=1

primitive azure_lb_health_probe azure-lb \ params port=61000

primitive pri-ip_vip_10.4 IPaddr2 \ params ip=10.10.10.4 cidr_netmask=24 nic=eth0 \ op monitor interval=0

primitive pri-javid_systemd systemd:javid \ op start timeout=60 interval=0 \ op stop timeout=60 interval=0 \ op monitor timeout=60 interval=10

primitive stonith-sbd stonith:external/sbd \ params pcmk_delay_max=30s \ meta target-role=Started \ op start interval=0

group grp-javid_systemd-azure_lb-vip_63 pri-javid_systemd azure_lb_health_probe pri-ip_vip_63 \ meta target-role=Started migration-threshold=1

colocation col_grp-sqpiq_nfs-filesystem inf: grp-javid_systemd-azure_lb-vip_63 nfs_filesystem

order ord_grp-sqpiq_nfs-filesystem inf: nfs_filesystem grp-javid_systemd-azure_lb-vip_63

property cib-bootstrap-options: \ have-watchdog=false \ dc-version="1.1.19+20181105.ccd6b5b10-3.16.1-1.1.19+20181105.ccd6b5b10" \ cluster-infrastructure=corosync \ cluster-name=cls_iqdb_iav \ stonith-enabled=true \ last-lrm-refresh=1589290714 \ no-quorum-policy=ignore

rsc_defaults rsc-options: \ resource-stickiness=1000

op_defaults op-options: \ timeout=600 \ record-pending=true

Pacemaker Log:

https://drive.google.com/open?id=1qfRkr0708JKJlc8Nv7TDsNjLyVZ1f7TI
 
Last edited:


OS and Pacemaker Versions:

PRETTY_NAME="SUSE Linux Enterprise Server 12 SP4" CPE_NAME="cpe:/o:suse:sles_sap:12:sp4"

pacemakerd --version
Pacemaker 1.1.19+20181105.ccd6b5b10-3.16.1

corosync -v
Corosync Cluster Engine, version '2.3.6'

When cluster resources are running on glbgvldbies01, everything is fine, but when glbgvldbies01 is rebooted or down for some reason, the resource are not able to start on glbgvldbies02, We are getting the following error

Resource showing as "unrunnable start (blocked)"

CRM CONFIG:

node 1: glbgvldbies02

node 2: glbgvldbies01 \ attributes maintenance=off

primitive nfs_filesystem Filesystem \ params device="10.10.10.15:/Database" directory="/usr/sap/IAV" fstype=nfs options="rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys" \ op stop timeout=30s interval=0 \ op monitor interval=10s timeout=20s \ op start timeout=30s interval=0 \ meta target-role=Started migration-threshold=1

primitive azure_lb_health_probe azure-lb \ params port=61000

primitive pri-ip_vip_10.4 IPaddr2 \ params ip=10.10.10.4 cidr_netmask=24 nic=eth0 \ op monitor interval=0

primitive pri-javid_systemd systemd:javid \ op start timeout=60 interval=0 \ op stop timeout=60 interval=0 \ op monitor timeout=60 interval=10

primitive stonith-sbd stonith:external/sbd \ params pcmk_delay_max=30s \ meta target-role=Started \ op start interval=0

group grp-javid_systemd-azure_lb-vip_63 pri-javid_systemd azure_lb_health_probe pri-ip_vip_63 \ meta target-role=Started migration-threshold=1

colocation col_grp-sqpiq_nfs-filesystem inf: grp-javid_systemd-azure_lb-vip_63 nfs_filesystem

order ord_grp-sqpiq_nfs-filesystem inf: nfs_filesystem grp-javid_systemd-azure_lb-vip_63

property cib-bootstrap-options: \ have-watchdog=false \ dc-version="1.1.19+20181105.ccd6b5b10-3.16.1-1.1.19+20181105.ccd6b5b10" \ cluster-infrastructure=corosync \ cluster-name=cls_iqdb_iav \ stonith-enabled=true \ last-lrm-refresh=1589290714 \ no-quorum-policy=ignore

rsc_defaults rsc-options: \ resource-stickiness=1000

op_defaults op-options: \ timeout=600 \ record-pending=true

Pacemaker Log:

https://drive.google.com/open?id=1qfRkr0708JKJlc8Nv7TDsNjLyVZ1f7TI


Today we have observed few more things

We have observed few things from the today testing. Following are the steps:

Step 1: When we create kernel panic (on Node01) with the command “echo 'b' > /proc/sysrq-trigger” or “echo 'c' > /proc/sysrq-trigger” on the node where the resources are running, then the cluster detecting the change but unable to start any resources (except SBD) on other active node.

Step 2: As per the logs we can find the following errors:

pengine: info: LogActions: Leave stonith-sbd (Started node02)

pengine: notice: LogAction: * Start pri-javaiq (node02 ) due to unrunnable nfs_filesystem start (blocked)

pengine: notice: LogAction: * Start lb_health_probe (node02 ) due to unrunnable nfs_filesystem start (blocked)

pengine: notice: LogAction: * Start pri-ip_vip (node02 ) due to unrunnable nfs_filesystem start (blocked)

pengine: notice: LogAction: * Start nfs_filesystem (node02 ) blocked

Step 3: But when we execute “init 6” on the node (on which we have created ‘kernel panic’), surprisingly the resources on other node are starting and running successfully.
 


Top