VMware Erroe: NMP nmp_CompleteCommandForPath

(0 comments)

这两天,在调试VMware主机到存储的连接时,在/var/log/vmkernel文件中,发现如下报错信息:

May 16 21:01:17 sh-myhost vmkernel: 30:08:48:10.811 cpu4:4360)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027fa02040) to NMP device "naa.600508b1001030394130373539301700" failed on physical path "vmhba2:C0:T0:L0" H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

之前还以为只有访问LUN的时候会报以下日志,后来在其它没有连SAN交换机得服务器上,发现访问本地磁盘也会有类似的出错。

根据错误代码,在VMware KB中查询到以下NMP errors/conditions :

For ESX 4.0 and later versions:

H:0xA D:0xB P:0xC Possible sense data: 0xD 0xE 0xF.

A = Host status (Initiator)
B = Device Status (Target)
C = Plugin (VMware Specific)
D = Sense Key
E = Additional Sense Code
F = Additional Sense Code Qualifier

根据 SCSI host-side NMP errors/conditions in ESX 4.x  Interpreting SCSI sense codes 定义,如果代码是H:0x0 D0x0 P:0x0时,表示对磁盘或者LUN的访问是正常的。如果代码不是H:0x0,你可以在前面的文档中找到相应的错误描述。

对于没有配置Battery Backed Write Cache (BBWC)模块的机器,需要购买BBWC模块来解决这个问题:

http://communities.vmware.com/message/1724815
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01832427&lang=en&cc=us&taskId=135&prodSeriesId=420496&prodTypeId=18964

而对于SAN环境下面的报错,下面的KB给出了详细的分析:

http://communities.vmware.com/thread/222692

I tried to investigate the issue, had a conversation with our SAN vendor, and I think that I do, in fact, have some answers.

(1) nmp_CompleteCommandForPath ... Command 0x2a to NMP device failed on physical path ... Possible sense data 0x0 0x0 0x0: 

(1a) Analysis:

Jul 28 08:39:54 vmware05 vmkernel: 0:20:44:22.115 cpu1:4259)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100020c7f40) to NMP device "naa.60050cc000205a840000000000000023" failed on physical path "vmhba2:C0:T1:L0" H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.

Jul 28 08:39:54 vmware05 vmkernel: 0:20:44:22.115 cpu1:4259)ScsiDeviceIO: 747: Command 0x2a to device "naa.60050cc000205a840000000000000023" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.

The sense codes logged by VMware stand for "TASK SET FULL". Our SAN vendor told us that, at least for them, this is a known "issue". In fact, it is not even a real issue. The explanation is: The SAN's controller has a write cache (for each array). When a single host, for example, writes a lot of data to a single array, the write cache might be full, and other hosts might not be able to write to the write cache. Our SAN offers a setting for "overload management". When overload management is enabled the hosts that have to wait until the write cache is free will be sent the message "TASK SET FULL" by the SAN's controller. I.e., these hosts cannot write to the SAN at the moment and will have to wait. VMware waits and logs this event with the corresponding sense data for "TASK SET FULL" to /var/log/vmkernel.

(1b) Additional information:

There is a VMware Knowledge Base article on SCSI sense codes: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=289902

The log message above contains the following codes:

- H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0

The interesting section here is the code starting with "D" (D stands for "device status"). Device status 0x28 means for "TASK SET FULL".

(1c) Solution:

I re-configured our SAN. The write cache setting for each array was set to "maximum", and I reduced it to a fixed amount. Hence, the arrays really act independently from each other. (Otherwise a write cache congestion on one array can have an impact on other arrays). Moreover, I changed the "overload management setting" from:

- Enabled: Commands that can not be accepted before the response timeout will fail with the status TASK SET FULL (0x28).

to:

- Disabled: No target queue full timeout will be enforced. Commands will wait until they can be processed or are timed out in the transport layer.

Furthermore, I activated the option "Enable cache Writethrough operation when write cache is full." (I prefer slow write operations to the SAN to no write operations.) 

(1d) Note:

The log messages do not appear any longer. (At least at the moment.) However, the log messages did not appear in ESX 3.5-U2 anyway -- they only started appearing in ESX 4.0. So either ESX 4.0 handles SCSI write commands in a different way (rather unlikely) or ESX 4.0 simply logs more or increasingly detailed messages.

(2) nmp_RegisterDevice: Registration of NMP device failed:

(2a) Analysis:

Jul 26 01:34:43 vmware05 vmkernel: 3:15:35:02.465 cpu2:4418)WARNING: NMP: nmp_RegisterDevice: Registration of NMP device with primary uid 'mpx.vmhba1:C0:T1:L6' failed. Already exists

Jul 26 01:34:43 vmware05 vmkernel: 3:15:35:02.466 cpu2:4418)WARNING: NMP: nmp_RegisterDevice: Registration of NMP device with primary uid 'mpx.vmhba2:C0:T0:L6' failed. Already exists

Jul 26 01:34:43 vmware05 vmkernel: 3:15:35:02.466 cpu2:4418)WARNING: NMP: nmp_RegisterDevice: Registration of NMP device with primary uid 'mpx.vmhba2:C0:T1:L6' failed. Already exists

I have six LUNs on the SAN (LUN 0 through LUN 5). LUN 6 is the SAN's controller. So these error messages correspond to the SAN's controller and not to any of the datastores.

Unfortunately, I do not have an answer for this issue yet ... and /var/log/vmkernel is filling up rapidly -- at 26,000 lines or 4.5 MB per hour.

What I'd like to see is a ESX setting that lets me disable these messages for a given LUN.

Currently unrated

Comments

There are currently no comments

New Comment

required

required (not published)

optional

required