Troubleshooting RBD Mount problems with Rook Ceph

Troubleshooting RBD Mount problems with Rook Ceph

In this troubleshooting blog, we will begin with discussing mount problems in a Rook Ceph cluster. Our customer Deasil Works was facing an issue where some of their client pods were failing to mount RBD volumes after they recently had a major network disruption in their Rook Ceph cluster. We helped them in investigating and solving this issue. There may be situations when the application pod faces some disruption in mounting and accessibility of block storage volume on the client side. Some of them can be:

  • The environment faced a network disruption and is currently recovering from that.
  • The Ceph cluster might not be healthy.
  • CSI node plugin is not able to connect to the Ceph cluster.
  • Commands getting stuck in the Ceph CSI node plugin pods.

Troubleshooting and Investigation

We began our troubleshooting and investigation by following the below steps:

Step 1: To begin troubleshooting this issue, please run the following command to describe the respective client pod which is having the issue:

1
kubectl describe -n <namespace> <pod>

Step 2: Check the Events: section in the above command output. Here’s an example from the customer issue we were investigating with our customer Deasil Works :

1
2
3
4
5
6
7
8
Events:
  Type     Reason       Age                     From     Message
  ----     ------       ----                    ----     -------
  Warning  FailedMount  51m (x235 over 2d3h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[rxtx-sale-data-volume], unattached volumes=[linkerd-identity-end-entity linkerd-identity-token rxtx-sale-data-volume linkerd-proxy-init-xtables-lock]: timed out waiting for the condition
  Warning  FailedMount  31m (x605 over 2d3h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[rxtx-sale-data-volume], unattached volumes=[linkerd-proxy-init-xtables-lock linkerd-identity-end-entity linkerd-identity-token rxtx-sale-data-volume]: timed out waiting for the condition
  Warning  FailedMount  17m (x250 over 2d3h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[rxtx-sale-data-volume], unattached volumes=[rxtx-sale-data-volume linkerd-proxy-init-xtables-lock linkerd-identity-end-entity linkerd-identity-token]: timed out waiting for the condition
  Warning  FailedMount  13m (x265 over 2d3h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[rxtx-sale-data-volume], unattached volumes=[linkerd-identity-token rxtx-sale-data-volume linkerd-proxy-init-xtables-lock linkerd-identity-end-entity]: timed out waiting for the condition
  Warning  FailedMount  2m3s (x1077 over 2d3h)  kubelet  MountVolume.MountDevice failed for volume "pvc-881d8d43-6126-4bfc-9f83-fdca5a51a9f3" : rpc error: code = Internal desc = rbd image replicapool/csi-vol-11384a6a-3f88-11ed-bea7-0298a448c4b2 is still being used

Step 3: Identification of Error message. We found the following error message:

1
MountVolume.MountDevice failed for volume "pvc-<id>" : rpc error: code = Internal desc = rbd image replicapool/csi-vol-<id> is still being used

The mount point on the pvc on the client pod was failing and shown as still used, as it was disrupted, due to a transient network issue, while still being mounted.

Resolution

After finding that out, we solved the issue the following way:

  • Find the corresponding RBD image which is still being used.

  • Unmount and unmap the rbd image if it’s mapped.

  • If the above step is unsuccessful, remove the watcher / blocklist the ip+nonce

    1
    
    ceph osd blocklist add <WATCHER-IP>
    
  • Wait for the pod to recover in a couple of minutes.

The client application pod should recover successfully.

There’s a similar issue that can be solved the above way if it shows that the csi volume already exists:

1
2
GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID
0001-0009-rook-ceph-0000000000000001-8d0ba728-0e17-11eb-a680-ce6eecc894de already exists.

The issue typically is in the Ceph cluster or network connectivity. If the issue is in Provisioning the PVC Restarting the Provisioner pods help (for CephFS issue restart csi-cephfsplugin-provisioner-xxxxxx CephFS Provisioner. For RBD, restart the csi-rbdplugin-provisioner-xxxxxx pod.) If the issue is in mounting the PVC, restart the csi-rbdplugin-xxxxx pod (for RBD) and the csi-cephfsplugin-xxxxx pod for CephFS issue.

Please run the following commands from the toolbox pod to check for the ceph cluster health details:

1
2
ceph -s
ceph health detail

Further it’s always great to refer to CSI common issues page for more information regarding troubleshooting these issues.

We received great feedback from our customer Jeff Masud from Deasil Works, Inc. and they were really happy with the quick support provided to them.

Thanks for reading! Please contact us if you need any help with Rook Ceph!

Gaurav Sitlani March 20, 2023