Tuesday, March 6, 2012

The cluster service has determined that this node does not have the latest copy of cluster configuration data

A few days ago I got a big surprise on my Exchange 2010 Servers. While doing a routine daily check via Exchange Management Console, I noticed that one of my DAG servers was reporting its status as Failed. This looked really weird since the server itself was online and I could ping it.

I RDP'd to the problematic server and checked the eventlogs. To my dismay, I found that the following was being logged in the system logs

Event ID 1564 - File share witness resource 'File Share Witness (\\Server1.domain.com\DAG.domain.com)' failed to arbitrate for the file share '\\Server1.domain.com\DAG.domain.com'. Please ensure that file share '\\Server1.domain.com\DAG.domain.com' exists and is accessible by the cluster.

I opened the Failover Cluster Manager and found that the above server had been marked as down. This explained why the server was being reported with a status of Failed in the Exchange Management Console.

I checked the permissions on the DAG share above. The NTFS permissions looked alright. However, the share permissions had an unresolved SID. I took this to be the culprint and doing a bit of googling I found that the cluster computer account should be listed in the share permissions with Full Control. Since this was not listed, I took the liberty of adding in DAG$ in the share permissions with Full Control (my DAG name is called DAG .. yea yea very original).

After doing the above, I noticed that the above error was no longer being shown,  but instead the following error appeared.
"Event ID 1561 The cluster service has determined that this node does not have the latest copy of cluster configuration data."

And to add salt to my injury,  the server was still marked as down in Failover Cluster Manager. I googled the error and managed to find some articles on it. One of the support articles from Microsoft said to start all the other nodes and if they started, then the affected node will read the configuration off them and start. However, since I already had a node running (the other DAG server), this didnt quite apply.

I tried restarting the cluster services on the affected server, but even this did not resolve the issue. Restarting the server was no help either.

Finally, with a stroke of genius (and luck), I decided to restart the cluster. So from the Failover Cluster Manager,  I right clicked on the cluster name (which quite originally was called dag.domain.com) and then under More Actions  I selected Shut down Cluster. A prompt came up asking if I really wanted to shut down the cluster. I chose Yes.

After a few minutes (well actually 2min), with fingers crossed, I started the cluster (from Failover Cluster Manager, right click on the DAG name and from More Actions select Start Cluster). Viola, both the DAG servers came back online!

I quickly checked Exchange Management Console and saw that both the servers were now being reported as online. The problematic server was now being updated from the other server (you might see a huge CPU spike on the problematic server while the updates are copied to it)

Take care and until the next time. And remember, with windows, you restart :)

1 comment:

lee woo said...

It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all, in which case you have failed by default. See the link below for more info.