Friday, December 17, 2010

vSphere cluster: max 4 ESX hosts per “location” because of HA limitations?


 Duncan Epping has a couple of extremely interesting posts on Yellow-Bricks.com concerning HA even when it comes down to selecting or promoting the HA status of ESX nodes (a must read!), but I want more …
Let’s start with what I assume to know about HA:
- HA works with primary and secondary HA nodes
- The primary nodes are aware of the states and configs of all nodes in an HA cluster
- The secondary nodes depend on the primary nodes
- There is an supported limit of 5 primary HA nodes per cluster
- The first 5 ESX hosts that are added in a HA cluster are initially defined as primary HA nodes
- All the other hosts that are added to the HA cluster are configured as secondary HA nodes
- There’s a way to configure a HA node as primary or secondary, however it’s not possible to configure an ESX host as a “fixed” primary HA node:

  • /opt/vmware/aam/bin/Cli
AAM> promotenode (Configure host as a primary HA node)
  • /opt/vmware/aam/bin/Cli
AAM> demotenode    (Configure host as a secondary HA node)

- One primary HA node is the Active Primary HA node; this node coordinates the restarts of the VM’s that went down with “crashed” host.
- When the Active Primary HA node goes down, another primary is (s)elected as Active Primary HA node” and takes over the coordinating role.
- A new primary is chosen when another primary is disconnected from the cluster in one of these situations:
  •  (Re)configuring HA on a host
  • Disconnecting a host from the cluster (manually or by failure)
  • Removing a host from the cluster
  • In case of a HA failure
  • Putting a host into maintenance mode
Especially when you read the last bullet we can establish that HA roles are really dynamic in a VI/vSphere environment. This means that you have no control over the physical location of the primary and secondary roles.

And this is what my post is about:
This situation freaks me out because when you have a larger environment with a couple of possible failure domains as I’d like to call them (represented by any physically separated group of hosts within an HA cluster like different blade chassis or different server rooms) you want to have control over the placement of these HA roles.

And as I stated earlier Duncan Epping has some interesting articles like the HA deep dive and the Primary and Secondary nodes, pick one! which describe how to select a role for a host but this selection is not static; whenever a primary host is disconnected (Maintenance mode, Reconfigure HA and so on) there is a reelection and you lose control over the role placement.

So what if all 5 primaries HA nodes are on the same “possible failure domain” (say blade chassis) and that goes down? Well you just lost all your HA nodes that know what to do in case of a host-failure, so HA won’t work!
We’ll have to nuance the drama a bit: if 5 hosts of a “10 ESX host cluster” go down, you have a major issue anyway, if HA works or not, because you lost half of your capacity.

But you do have to realize that if HA is configured correctly, the 5 remaining hosts have some resources available, you have your primaries separated over the 2 locations and you have defined the start-up rules for the most important VM’s, these important VM’s will be booted up.

If you have the same situation as above but with all 5 primary HA nodes down because they were physically grouped, HA won’t work and none of the crashed VM’s will be booted up automatically!

During VMworld 2009 Marc Sevigny from VMware explained that they were looking into an option which would enable you to pick your primary hosts.This would solve the problem but until then the only solution is to keep your clusters limited to a total of 8 ESX hosts , 4 ESX hosts per “possible failure domain”.
I’m curious if I’m the only one running into this challenge; please let me know!

P.S. Special kudo’s go to Remon Lam from vminfo.nl who discovered this “feature” and reviewed the article.

No comments:

Post a Comment