Troubleshooting RBD Mount problems with Rook Ceph

In this troubleshooting blog, we will begin with discussing mount problems in a Rook Ceph cluster. Our customer Deasil Works was facing an issue where some of their client pods were failing to mount RBD volumes after they recently had a major network disruption in their Rook Ceph cluster. We helped them in investigating and solving this issue. There may be situations when the application pod faces some disruption in mounting and accessibility of block storage volume on the client side. Some of them can be:
- The environment faced a network disruption and is currently recovering from that.
- The Ceph cluster might not be healthy.
- CSI node plugin is not able to connect to the Ceph cluster.
- Commands getting stuck in the Ceph CSI node plugin pods.
Troubleshooting and Investigation
We began our troubleshooting and investigation by following the below steps:
Step 1: To begin troubleshooting this issue, please run the following command to describe the respective client pod which is having the issue:
|
|
Step 2: Check the Events:
section in the above command output.
Here’s an example from the customer issue we were investigating with our customer Deasil Works :
|
|
Step 3: Identification of Error message. We found the following error message:
|
|
The mount point on the pvc on the client pod was failing and shown as still used, as it was disrupted, due to a transient network issue, while still being mounted.
Resolution
After finding that out, we solved the issue the following way:
Find the corresponding RBD image which is still being used.
Unmount and unmap the rbd image if it’s mapped.
If the above step is unsuccessful, remove the watcher / blocklist the ip+nonce
1
ceph osd blocklist add <WATCHER-IP>
Wait for the pod to recover in a couple of minutes.
The client application pod should recover successfully.
There’s a similar issue that can be solved the above way if it shows that the csi volume already exists:
The issue typically is in the Ceph cluster or network connectivity. If the issue is in Provisioning the PVC Restarting the Provisioner pods help (for CephFS issue restart csi-cephfsplugin-provisioner-xxxxxx
CephFS Provisioner. For RBD, restart the csi-rbdplugin-provisioner-xxxxxx
pod.) If the issue is in mounting the PVC, restart the csi-rbdplugin-xxxxx
pod (for RBD) and the csi-cephfsplugin-xxxxx
pod for CephFS issue.
Please run the following commands from the toolbox pod to check for the ceph cluster health details:
Further it’s always great to refer to CSI common issues page for more information regarding troubleshooting these issues.
We received great feedback from our customer Jeff Masud from Deasil Works, Inc. and they were really happy with the quick support provided to them.
Thanks for reading! Please contact us if you need any help with Rook Ceph!