Known is a Drop; Unknown is an Ocean

Friday, December 17, 2010

vSphere cluster: max 4 ESX hosts per “location” because of HA limitations?

Duncan Epping has a couple of extremely interesting posts on Yellow-Bricks.com concerning HA even when it comes down to selecting or promoting the HA status of ESX nodes (a must read!), but I want more …

Let’s start with what I assume to know about HA:

- HA works with primary and secondary HA nodes
- The primary nodes are aware of the states and configs of all nodes in an HA cluster
- The secondary nodes depend on the primary nodes
- There is an supported limit of 5 primary HA nodes per cluster
- The first 5 ESX hosts that are added in a HA cluster are initially defined as primary HA nodes
- All the other hosts that are added to the HA cluster are configured as secondary HA nodes
- There’s a way to configure a HA node as primary or secondary, however it’s not possible to configure an ESX host as a “fixed” primary HA node:

/opt/vmware/aam/bin/Cli

AAM> promotenode (Configure host as a primary HA node)

/opt/vmware/aam/bin/Cli

AAM> demotenode (Configure host as a secondary HA node)

- One primary HA node is the Active Primary HA node; this node coordinates the restarts of the VM’s that went down with “crashed” host.
- When the Active Primary HA node goes down, another primary is (s)elected as Active Primary HA node” and takes over the coordinating role.
- A new primary is chosen when another primary is disconnected from the cluster in one of these situations:

(Re)configuring HA on a host
Disconnecting a host from the cluster (manually or by failure)
Removing a host from the cluster
In case of a HA failure
Putting a host into maintenance mode

Especially when you read the last bullet we can establish that HA roles are really dynamic in a VI/vSphere environment. This means that you have no control over the physical location of the primary and secondary roles.

And this is what my post is about:

This situation freaks me out because when you have a larger environment with a couple of possible failure domains as I’d like to call them (represented by any physically separated group of hosts within an HA cluster like different blade chassis or different server rooms) you want to have control over the placement of these HA roles.

And as I stated earlier Duncan Epping has some interesting articles like the HA deep dive and the Primary and Secondary nodes, pick one! which describe how to select a role for a host but this selection is not static; whenever a primary host is disconnected (Maintenance mode, Reconfigure HA and so on) there is a reelection and you lose control over the role placement.

So what if all 5 primaries HA nodes are on the same “possible failure domain” (say blade chassis) and that goes down? Well you just lost all your HA nodes that know what to do in case of a host-failure, so HA won’t work!

We’ll have to nuance the drama a bit: if 5 hosts of a “10 ESX host cluster” go down, you have a major issue anyway, if HA works or not, because you lost half of your capacity.

But you do have to realize that if HA is configured correctly, the 5 remaining hosts have some resources available, you have your primaries separated over the 2 locations and you have defined the start-up rules for the most important VM’s, these important VM’s will be booted up.

If you have the same situation as above but with all 5 primary HA nodes down because they were physically grouped, HA won’t work and none of the crashed VM’s will be booted up automatically!

During VMworld 2009 Marc Sevigny from VMware explained that they were looking into an option which would enable you to pick your primary hosts.This would solve the problem but until then the only solution is to keep your clusters limited to a total of 8 ESX hosts , 4 ESX hosts per “possible failure domain”.

I’m curious if I’m the only one running into this challenge; please let me know!

P.S. Special kudo’s go to Remon Lam from vminfo.nl who discovered this “feature” and reviewed the article.

Thursday, August 26, 2010

DSR Direct Routing Loopback Adapters in Windows Server 2008

Recently I ran across an interesting networking issue when using a DSR load balancing setup when converting from Windows Server 2003 to Windows Server 2008. It seems that microsoft has changed the way that the TCP/IP stack functions, so that when we went to configure the loopback adapters and setup IIS to run our websites, the server was unreachable. Worse it had a negative effect on routing to our production servers, since the server in question runs as an auxillary server for the production sites.

Anyhow, we finally found a great article that chronicles the issues, and how to correct them. It all boils down to a couple of commands on the command line to correct it:

netsh interface ipv4 set interface "net" weakhostreceive=enabled

netsh interface ipv4 set interface "loopback" weakhostreceive=enabled

netsh interface ipv4 set interface "loopback" weakhostsend=enabled

Monday, May 24, 2010

esxcfg-vswitch - Virtual Switch Configuration tool

NAME

esxcfg-vswitch - VMware ESX Server Virtual Switch Configuration tool

SYNOPSIS

esxcfg-vswitch OPTIONS [VSWITCH]

DESCRIPTION

esxcfg-vswitch provides an interface for adding, removing, and modifying virtual switches and their settings. By default, there is a single virtual switch called vSwitch0.

OPTIONS

-a -add

Add a new virtual switch to the system. It requires a virtual switch name to be provided.

-d -delete

Delete a virtual switch. This will fail if any ports on the virtual switch are still in use by VMkernel networks, vswifs, or VMs.

-l -list

List all virtual switches and their port groups.

-L -link

Add an uplink to a virtual switch. This will attach a new unused physical NIC to a virtual switch.

-U -unlink

Remove an uplink from a virtual switch. This will remove a NIC from the uplink list of a virtual switch. If it is the last uplink, physical network connectivity for that switch will be lost.

-p -pg

Provide the name of the portgroup for the '--vlan' option. "ALL" can be specified to operate on all portgroups of a virtual switch.

-v -vlan

Set the VLAN ID for a specific portgroup of a virtuals switch Using the option "0" will disable VLAN for this portgroup. Requires that the --pg option is also specified.

-c -check

Check to see if a virtual switch exists. The program prints a "1" if it exists; otherwise it prints "0".

-A -add-pg

Add a new portgroup to a virtual switch with the given name.

-D -del-pg

Delete a portgroup. This operation will fail if the portgroup is in use.

-C -check-pg

Check to see if the name given is in use for a portgroup. The program prints a "1" if it exists; otherwise prints "0".

-r -restore

Used at system startup to restore configuration. This should not be run by users.

-h -help

Print a simple help message.

EXAMPLES

Add a Virtual Switch called vswitch1:

esxcfg-vswitch -a vSwitch1

Add a Portgroup called 'Production' to vSwitch0:

esxcfg-vswitch -A Production vSwitch0

Add a physical network card, vmnic3, to vSwitch0:

esxcfg-vswitch -L vmnic3 vSwitch0

To remove the vlan id completely, just set it to 0 (in case you have set it by accident on an access port)

esxcfg-vswitch vSwitch0 -v 0 -p “Service Console”

To set a vlan id on the service console (in case you forgot to define this during the installation)

esxcfg-vswitch vSwitch0 -v X -p “Service Console” (enter the vlan number where X is)

Of course make sure to check which vSwitch the Service Console is on (and the name of the Service Console) with esxcfg-vswitch -l