I had a 19c cluster node crashed and the clusterware failed to startup due to this error:
[root@fzppon05vs1n ~]# crsctl start crs
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-11960 : Set user ID bit is not set for file "/u01/grid/12.2.0.3/bin/extjob" on node "fzppon05vs1n".
PRVG-2031 : Owner of file "/u01/grid/12.2.0.3/bin/extjob" did not match the expected value on node "fzppon05vs1n". [Expected = "root(0)" ; Found = "oracle(54321)"]
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
That's weird! Because the file mentioned in the error message already has the right ownership; which is supposed to be owned by Grid owner --which is Oracle in my setup, and it shouldn't be owned by root as advised by the error message:
[root@fzppon05vs1n ~]# ll /u01/grid/12.2.0.3/bin/extjob
-rwxr-xr-x 1 oracle oinstall 2.9M Mar 4 11:42 /u01/grid/12.2.0.3/bin/extjob
The same permissions and ownership on the other RAC node as well:
[oracle@fzppon06vs1n ~]$ ls -l /u01/grid/12.2.0.3/bin/extjob
-rwxr-xr-x 1 oracle oinstall 2.9M Mar 4 12:57 /u01/grid/12.2.0.3/bin/extjob
I've tried to stop the clusterware on this node with force option and start it back, but this didn't help.
Before trying to restart the OS, just thought to check the clusterware background processes, and here is the catch:
[root@fzppon05vs1n ~]# ps -ef | grep -v grep| grep '\.bin'
root 19786 1 1 06:18 ? 00:00:39 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19788 1 0 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19850 1 0 06:18 ? 00:00:13 /u01/grid/12.2.0.3/bin/orarootagent.bin
root 19958 1 0 06:18 ? 00:00:14 /u01/grid/12.2.0.3/bin/oraagent.bin
...
Found lots of ohasd.bin are running, while it supposed to be only one ohasd.bin process
Checking all ohasd related processes:
[root@fzppon05vs1n ~]# ps -ef | grep -v grep | grep ohasd
root 1900 1 0 06:17 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run>/dev/null 2>&1 </dev/null
root 1947 1900 0 06:17 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run>/dev/null 2>&1 </dev/null
root 19786 1 1 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19788 1 0 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
Now, let's kill all ohasd processes and give it a try:
[root@fzppon05vs1n ~]# kill -91900 1947 19786 19788
Starting back the clusterware:
[root@fzppon05vs1n ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
Voilà! Started up.
Conclusion:
Above error message may look vague... I know. Moreover, it may mention a different file in the error message rather than extjob.
Don't rush and change the file's ownership as advised by the error message, first check for any redundant clusterware background processes and kill it, then try to startup the clusterware. If this didn't help; restart the node and check again for any redundant processes.
[root@fzppon05vs1n ~]# crsctl start crs
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-11960 : Set user ID bit is not set for file "/u01/grid/12.2.0.3/bin/extjob" on node "fzppon05vs1n".
PRVG-2031 : Owner of file "/u01/grid/12.2.0.3/bin/extjob" did not match the expected value on node "fzppon05vs1n". [Expected = "root(0)" ; Found = "oracle(54321)"]
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
That's weird! Because the file mentioned in the error message already has the right ownership; which is supposed to be owned by Grid owner --which is Oracle in my setup, and it shouldn't be owned by root as advised by the error message:
[root@fzppon05vs1n ~]# ll /u01/grid/12.2.0.3/bin/extjob
-rwxr-xr-x 1 oracle oinstall 2.9M Mar 4 11:42 /u01/grid/12.2.0.3/bin/extjob
The same permissions and ownership on the other RAC node as well:
[oracle@fzppon06vs1n ~]$ ls -l /u01/grid/12.2.0.3/bin/extjob
-rwxr-xr-x 1 oracle oinstall 2.9M Mar 4 12:57 /u01/grid/12.2.0.3/bin/extjob
I've tried to stop the clusterware on this node with force option and start it back, but this didn't help.
Before trying to restart the OS, just thought to check the clusterware background processes, and here is the catch:
[root@fzppon05vs1n ~]# ps -ef | grep -v grep| grep '\.bin'
root 19786 1 1 06:18 ? 00:00:39 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19788 1 0 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19850 1 0 06:18 ? 00:00:13 /u01/grid/12.2.0.3/bin/orarootagent.bin
root 19958 1 0 06:18 ? 00:00:14 /u01/grid/12.2.0.3/bin/oraagent.bin
...
Found lots of ohasd.bin are running, while it supposed to be only one ohasd.bin process
Checking all ohasd related processes:
[root@fzppon05vs1n ~]# ps -ef | grep -v grep | grep ohasd
root 1900 1 0 06:17 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run>/dev/null 2>&1 </dev/null
root 1947 1900 0 06:17 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run>/dev/null 2>&1 </dev/null
root 19786 1 1 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19788 1 0 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
Now, let's kill all ohasd processes and give it a try:
[root@fzppon05vs1n ~]# kill -91900 1947 19786 19788
Starting back the clusterware:
[root@fzppon05vs1n ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
Voilà! Started up.
Conclusion:
Above error message may look vague... I know. Moreover, it may mention a different file in the error message rather than extjob.
Don't rush and change the file's ownership as advised by the error message, first check for any redundant clusterware background processes and kill it, then try to startup the clusterware. If this didn't help; restart the node and check again for any redundant processes.