ESXi reboot process may take long time and remain unresponsive

You may see following symptoms:

  • vCenter may show host in not responding status.
  • During reboot, loading ESXi module takes longer then expected time.
  • Post successful reboot, host UI page (https://host/ui) shows errorError “503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x1f0a1118] _serverNamespace = / _isRedirect = false _port = 8309)”

HostD service shows running. Refer KB

[root@ESXiHost1:~] /etc/init.d/hostd status
hostd is running.

You may see following lines in VMKernel.log (To know more about ESXi log refer)

cpu7:38726)ALERT: hostd detected to be non-responsive
cpu6:32797)ScsiDeviceIO: 2652: Cmd(0x43bdc075e040) 0x1a, CmdSN 0x43d6 from world 0 to dev "naa.60050768018205118800000000000114" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
cpu2:33185)NMP: nmp_ResetDeviceLogThrottling:3343: Error status H:0x0 D:0x18 P:0x0 Sense Data: 0x0 0x0 0x0 from dev "naa.60050768018205118800000000000114" occurred 2451 times(of 2466 commands)
ScsiDeviceIO: 2652: Cmd(0x43bdc075e040) 0x1a, CmdSN 0x43da from world 0 to dev "naa.60050768018205118800000000000114" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
cpu1:33185)NMP: nmp_ResetDeviceLogThrottling:3343: Error status H:0x0 D:0x18 P:0x0 Sense Data: 0x0 0x0 0x0 from dev "naa.60050768018205118800000000000114" occurred 2457 times(of 2461 commands)
cpu3:33185)NMP: nmp_ResetDeviceLogThrottling:3343: Error status H:0x0 D:0x18 P:0x0 Sense Data: 0x5 0x20 0x0 from dev "naa.60050768018205118800000000000114" occurred 2416 times(of 2431 commands)
cpu3:33185)NMP: nmp_ResetDeviceLogThrottling:3343: Error status H:0x0 D:0x18 P:0x0 Sense Data: 0x5 0x20 0x0 from dev "naa.60050768018205118800000000000114" occurred 2460 times(of 2463 commands)
cpu6:32797)ScsiDeviceIO: 2652: Cmd(0x43bdc07dd740) 0x1a, CmdSN 0x4435 from world 0 to dev "naa.60050768018205118800000000000114" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
cpu1:33185)NMP: nmp_ResetDeviceLogThrottling:3343: Error status H:0x0 D:0x18 P:0x0 Sense Data: 0x5 0x20 0x0 from dev "naa.60050768018205118800000000000114" occurred 2461 times(of 2466 commands)
cpu7:38892)ALERT: hostd detected to be non-responsive

Here is the translation of the reason of above scsi sense code decoder tool

Device Status [0x18]
Name RESERVATION CONFLICT
Description This status is returned when a LUN is in a Reserved status and commands from initiators that did not place that SCSI reservation attempt to issue commands to it.

Run following command by taking SSH session to ESXi host.

Syntax:

esxcli storage core device list -d

Example:

[root@ESXiHost1:~] esxcli storage core device list -d naa.60050768018205118800000000000114
naa.60050768018205118800000000000114
Display Name: IBM Fibre Channel Disk (naa.60050768018205118800000000000114)
Has Settable Display Name: true
Size: 104448
Device Type: Direct-Access
Multipath Plugin: NMP
Devfs Path: /vmfs/devices/disks/naa.60050768018205118800000000000114
Vendor: IBM
Model: 2145
Revision: 0000
SCSI Level: 6
Is Pseudo: false
Status: on
Is RDM Capable: true
Is Local: false
Is Removable: false
Is SSD: false
Is VVOL PE: false
Is Offline: false
Is Perennially Reserved: false

To fix the issue use the esxcli command to mark the device as perennially reserved:

Syntax:

esxcli storage core device setconfig -d naa.id --perennially-reserved=true

Example:

[root@ESXiHost1:~]esxcli storage core device setconfig -d naa.60050768018205118800000000000114 --perennially-reserved=true

This issue comes when ESXi hosting virtual machines with RDMs mapped. Typically when application level cluster e.g. MSCS (Microsoft Server cluster service), VCS (Veritas Cluster solution) is configured and the virtual machines are sharing access to disks which are typically Raw Device Mappings (RDM). These RDMs are LUNs presented directly to virtual machines. During the reboot of a host, passive virtual machines of clustering solution actively uses the LUN (Disk). Since the active nodes have SCSI reservations on the shared disks/RDMs, this slows up the boot process of the ESXi as it tries to interrogate each of these disks during storage discovery.

Advertisements