DRBD State change failure

So I manage a number of multi-node VMWare clusters running on Vmware, XEN etc. Recently some of the nodes have developed a problem of not wanting to switch nodes. The error message is always the same.
State change failed: (-12) Device is held open by someone.
The problem is that the application running in that volume has stopped. The volume also unmounts and remounts with no issues. lsof returns no matches, and looking through /dev /proc returned no clues.
Looking through the support forums always returned the same fix. The volume was not unmounted, or its an application that failed. The issue is that some of these hosts are running 16 other VM Hosts all of which are important, so a reboot is not really an option, particularly if its recurring.
The solution turned out to be simple. After feeling defeated by the problem I started thinking that whatever was holding open the resource was very low level. A kernel module for example. The fix was to stop the backup agent. /etc/init.d/VRTSralus.init stop. That was it. It seems that somehow the Symantec Veritas Backup agent can get into a state where it convinces DRBD that all of its resources are still in use even when they appear to be free. Once that has happens drbdadm secondary {resource} will always return state change failure message.
Simple fix for a simple problem.

Comments are closed.