iSCSI: failed to connect

DISCLAIMER This post assumes you have some knowledge of OpenShift and OpenStack.

We run an OpenShift cluster on top of a OpenStack cluster, which in turn runs on top of bare metal machines.

Pods are ephemeral and persistent storage should be provided through volumes; in our case, volumes are dynamically provisioned by Cinder (the OpenStack block storage service); which in turn, depending on your configuration, may rely on the iSCSI protocol.

The reasons for the iSCSI protocol failures can be different, we did not delve in depth into them; suffices to say that in our case the problem arised after a yum -y upgrade.

Basically, OpenShift was not able anymore to deploy pods since Cinder could not provision volumes and Nova (the OpenStack compute service) could not attach them to the VM where pods were being deployed.

The error manifested as pod has unbound PersistentVolumeClaims on OpenShift, but as the following in /var/log/nova/nova-compute.log

ERROR oslo_messaging.rpc.server   File ""/usr/lib/python3.6/site-packages/os_brick/initiator/connectors/iscsi.py"", line 587, in _connect_single_volume
ERROR oslo_messaging.rpc.server     raise exception.VolumeDeviceNotFound(device='')
ERROR oslo_messaging.rpc.server os_brick.exception.VolumeDeviceNotFound: Volume device not found at .

Now, in iSCSI, an iSCSI node (which can be either a target or an initiator) is identified by a unique name so that storage can be managed regardless of address. iSCSI names are formatted in two different ways, EUI and IQN. We're only interested in the second. IQN (iSCSI Qualified Name) takes the form iqn.yyyy-mm.naming-authority:unique-name, where:

iqn, a prefix
yyyy-mm – the year and month when the naming authority was established. For example: 1992-08.
naming-authority – the organizational naming authority string, usually reverse syntax of the Internet domain name of the naming authority. For example: com.vmware. unique name – any name you want to use, such as the name of your host. For example: host-1.

If you get the status of the iSCSI service by running sudo systemctl status iscsid.service, you may notice that it tries to connect to an iSCSI node and the connection is refused. This, in turn, implies that all subsequent connections get refused and the errors cascade up to OpenShift.

By simply removing the culprit node folder in /var/lib/iscsi/nodes/<IQN> (the folder takes the name of the IQN you see in the iscsi status output) and restarting the iscsid.service, the problem is solved and everything works fine again.

P.S.

What really amazes me in these complex platform is how an error can manifest itself on top of the vertical stack but then have its root cause at the bottom of the same stack. I think that beyond systems for horintal tracing, we should also need systems for vertical tracing. That could simplify debugging such platforms, what do you all think?

P.P.S.

Many thanks to lorenzo.goglia@unisannio.it for the nice troubleshooting together😉

EDIT: a previous version mentioned iscsi, but the service is iscsid

Woland

Woland

iSCSI: failed to connect

an experience in an OpenShift/OpenStack virtualization platform