A 4-node GlusterFS cluster began throwing “atomic test and set of disk block returned false for equality” errors after a power outage. Metadata operations hung, and thick provisioning failed.
Resolving this error involves isolating whether the issue stems from the storage fabric, the host configuration, or the storage array itself. Step 1: Identify the Affected Datastore and Hosts
Some storage devices (especially network-attached ones) do not support true atomic test-and-set across power failures or multiple initiators. If your device cannot guarantee sector-level write atomicity, you may see spurious equality failures. Switch to a device that supports NVMe’s Compare and Write or SCSI’s COMPARE AND WRITE command. A 4-node GlusterFS cluster began throwing “atomic test
Resolving this issue requires looking at the application layer, the file system, and the storage hardware. 1. Increase Lock Timeouts
dlm: atomic test and set of disk block 1048576 returned false for equality (expected=0, got=1002) dlm: lock acquisition failed. Node 1002 already owns the lock. Step 1: Identify the Affected Datastore and Hosts
If multiple processes are competing to update the same disk block, one will succeed, and the others will fail the test-and-set. This is normal behavior, but if it happens persistently, it causes performance degradation. B. Misconfigured Distributed Locking
(Note: Consult VMware and your storage vendor documentation before changing advanced system parameters.) Step 5: Space Out Heavy I/O Workloads Resolving this issue requires looking at the application
If the errors occur during specific times (e.g., midnight backups or automated patch schedules), the storage array is likely experiencing transient saturation. Stagger backup schedules. Limit the number of concurrent VM migrations (vMotion).
Check for physical link errors (e.g., CRC errors on Fibre Channel switches) that could cause ATS command timeouts. Step 4: Fall Back to ATS+SCSI Locking (If Necessary)
: The host may issue a full reset on the LUN to "clear the air," which aborts all active I/O for every VM on that datastore. Degraded Path Redundancy