Measuring and preventing vSphere resource over-commitment

20150703 - VMwareRecently I was asked to perform a VMware environment review for a financial services customer of ours. This involved reviewing a number of settings across the estate, one of which being resource utilisation. The biggest challenge was to ensure that none of the customer’s clusters were over subscribed, meaning that in the event of a HA failover, virtual machines would fail to power on as expected.

Thankfully, the review revealed no major issues.

Later in the week a question came up in an internal training course… how exactly do we measure and prevent resource over-commitment?

Note: if you’re an experienced vSphere Administrator or Designer then nothing here will be news to you. Move along, and follow the Yellow Brick road.

A properly configured cluster should not only be enabled for High Availability, but should have Admission Control enabled too. The idea is that you tell Admission Control what spare capacity you would like to reserve, and it will prevent you from powering on virtual machines that violate that constraint.

We typically configure our clusters with “host failures the cluster tolerates = 1”, as shown in the following example:

20151104 - 1

Here we effectively saying “keep one host’s worth of compute as standby”. But what exactly is one “hosts worth”? In the London cluster that might be 20GHhz worth of CPU and 192GB of RAM. In the Amsterdam cluster it might mean half of that.

Clearly each cluster can be different, but the method of calculating that capacity remains the same – and to do so uses “slots”.

Slots are just an arbitrary name for a unit of measurement. For all we care, they might as well be called caramels.

A slot should be considered as a worst case scenario and reflects the size of the largest CPU and memory reservation you have defined. Do you have twenty VMs with 1GHz CPU reserved, and one with 4GHz? Then the slot is 4GHz (of CPU). Do you have ten VMs with a 4GB memory reservation and 1 with 2GB? Then the slot is 4GB.

If no reservations are set, then the defaults are used. in vSphere 5.5, that a slot size of 32MHz and 0MB (plus VM overhead). These can be configured manually, but it’s advisable to only do this if you know what you’re doing.

Once we know the slot size, and the aggregate computing power of the cluster, we can calculate how many slots are in use, how many are available, and how many HA keeps in reserve.  Problems arise when Admission Control is disabled. That effectively hobbles HA, and you begin to eat into your reserve. So what happens if a HA failover occurs?

The simple answer is that not everything will startup. So unless you have configured your VMs with priority, you might find the VDI desktop down in the post room comes back, but your Exchange server doesn’t.

To view cluster slot sizes, select the cluster summary tab and click Advanced Runtime Info:

20151104 - 2

Alternatively you can use the following code (substitute accordingly):

# Variables

$vc = "vcsa.lab.mdb-lab.com"
$credential = Get-Credential

# Connect to vCenter
Connect-VIServer $vc -Credential $credential

$SlotInfo = @()
Foreach ($Cluster in (Get-Cluster | Get-View)){
	$SlotDetails = $Cluster.RetrieveDasAdvancedRuntimeInfo()
	$Details = "" | Select Cluster, TotalSlots, UsedSlots, AvailableSlots, SlotNumvCPUs, SlotCPUMHz,SlotMemoryMB
	$Details.Cluster= $Cluster.Name
	$Details.TotalSlots = $SlotDetails.TotalSlots
	$Details.UsedSlots = $SlotDetails.UsedSlots
	$Details.AvailableSlots = $SlotDetails.UnreservedSlots
	$Details.SlotNumvCPUs = $SlotDetails.SlotInfo.NumvCpus
	$Details.SlotCPUMHz = $SlotDetails.SlotInfo.CpuMHz
	$Details.SlotMemoryMB = $SlotDetails.SlotInfo.MemoryMB
	$SlotInfo += $Details
}
$SlotInfo

# Disconnect from vCenter
Disconnect-VIServer $vc -Confirm:$false

Over-commitment

Unless you like surprises and being shouted at by customers, it’s best to get a handle on any cluster which may be overcommitted. If they are, flag this as a risk and add more resources quickly, before an outage comes along and prevents you from powering on your critical VMs.

The best place to start is to look for clusters with Admission Control disabled. If you have a large estate, use the following PowerCLI script to help find them:

# Variables

$infile = "C:\vcenters.csv"
$outfile = "C:\outfile.csv"

# Import file containing vCenter details
Import-Csv $infile | ForEach {

	# Connect to vCenter
	Connect-VIServer $_.vcenter -Username $_.user -Password $_.pass

	# Retrieve Admission Control details
	Get-VMHost | Get-Cluster | Select Name,HAAdmissionControlEnabled | Export-Csv $outfile

	# Disconnect from vCenter
	Disconnect-VIServer $vc -Confirm:$false
}

You’ll need to create a CSV file containing your vCenters and their connection details.

Remember sometimes disabling Admission Control is unavoidable, especially when you (or a customer) wishes to provision more virtual machines than resources will allow. Whilst this is far from ideal, ensure your HA restart priority is configured correctly to avoid the guy in the post room getting his desktop at the expense of everyone else’s email.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.