Monday, December 3, 2012

Windows Failover Cluster Live Migration Failures with Hyper-V 2012

After moving from one datacenter to another we started experiencing issues live migrating virtual machines from one host to another in our 2 node failover cluster.  The migration would instantly fail, and there would be no error other than:
Live migration of 'Virtual Machine VMNAME' failed.

If I did a quick migration it works, but live did not.  I started looking at the security logs of the hosts and noticed some intermittent errors:

An account failed to log on.

Subject:
Security ID: SYSTEM
Account Name: HYPERVHOSTCOMPUTER$
Account Domain: OURDOMAIN
Logon ID: 0x3E7

Logon Type: 8

Account For Which Logon Failed:
Security ID: NULL SID
Account Name: HYPERVHOSTCOMPUTER
Account Domain: OURDOMAIN.com

Failure Information:
Failure Reason: Unknown user name or bad password.
Status: 0xC000006D
Sub Status: 0xC000006A

Process Information:
Caller Process ID: 0xd30
Caller Process Name: C:\Windows\Cluster\rhs.exe

Network Information:
Workstation Name: HYPERVHOSTCOMPUTER
Source Network Address: -
Source Port: -

Detailed Authentication Information:
Logon Process: Advapi  
Authentication Package: Negotiate
Transited Services: -
Package Name (NTLM only): -
Key Length: 0

This event is generated when a logon request fails. It is generated on the computer where access was attempted.

The Subject fields indicate the account on the local system which requested the logon. This is most commonly a service such as the Server service, or a local process such as Winlogon.exe or Services.exe.

The Logon Type field indicates the kind of logon that was requested. The most common types are 2 (interactive) and 3 (network).

The Process Information fields indicate which account and process on the system requested the logon.

The Network Information fields indicate where a remote logon request originated. Workstation name is not always available and may be left blank in some cases.

The authentication information fields provide detailed information about this specific logon request.
- Transited services indicate which intermediate services have participated in this logon request.
- Package name indicates which sub-protocol was used among the NTLM protocols.
- Key length indicates the length of the generated session key. This will be 0 if no session key was requested.

Then I noticed errors in the cluster itself, at the same times:

Cluster network name resource 'Cluster Name' failed registration of one or more associated DNS name(s) for the following reason:
The handle is invalid.
.

Ensure that the network adapters associated with dependent IP address resources are configured with at least one accessible DNS server.

I looked at a domain controller and noticed a lot of Audit Failures for that computer object.  I opened the computer object in ADSI Edit, and noticed that the last login was 11/23 (the day we moved), and the last password reset was 11/24, which is incredibly odd.  The last bad login attempt was a few minutes ago.  I'm not sure how, but I think a password reset may have been attempted while the domain controllers were unavailable.

How I fixed it:

  1. Open Failover Cluster Manager
  2. Navigate to Cluster Core Resource
  3. Right click on the cluster network name and take it offline
  4. Right click on the cluster name and navigate to more actions -> repair


A few seconds later the cluster was repaired, I turned the cluster name back on and live migrations work.



Mystery solved.

HTH!

9 comments:

  1. This is pure magic! I spent 2 days troubleshooting this issue. I even destroyed & re-created the cluster.

    Your fix is brilliant, and just "works"!

    ReplyDelete
    Replies
    1. Hah, I'm happy I was able to help somebody! I fought this problem off and on (mostly off) for about a week before I decided it was a priority to fix.

      Delete
  2. Broke my head for 24 hours over this.

    big thumbs up to you for posting.

    ReplyDelete
  3. Great post. Fixed in 5 min after reading this.

    ReplyDelete
  4. Hi Andrew,
    Were you able to do this while the VMs in the cluster were running?

    ReplyDelete
  5. What happens to the VMs running on the cluster, will they be not accessible until the cluster name is online again?

    ReplyDelete
    Replies
    1. They are available, but we were unable to perform live migrations from one node to another.

      Delete