Objective 9 – Configure and Administer vSphere Availability Solution

Objective 9.1: Configure Advanced vSphere HA Features

  • Modify vSphere HA advanced cluster settings
  • Configure a network for use with HA heartbeats
    • Vmware best practice recommends at least two separate management networks, minimum one requirement.
    • When Virtual SAN and vSphere HA are enabled for the same cluster, the HA interagent traffic flows over this storage network rather than the management network. The management network is used by vSphere HA only when Virtual SAN is disabled. vCenter Server chooses the appropriate network when vSphere HA is configured on a host.
      • Virtual SAN can only be enabled when vSphere HA is disabled.
    • Host isolation address uses the management vmknic by default. HA heartbeat pings the host isolation address (default gateway by default) to check for host isolation status. If it cannot ping the host isolation address, the host is assumed to be isolated
    • das.isolationaddress
      • Add additional isolation addresses using das.isolationaddress0-9
    • If you change the value of any of the following advanced options, you must disable and then re-enable vSphere HA before your changes take effect.
      • ■ das.isolationaddress[…]
      • ■ das.usedefaultisolationaddress
      • ■ das.isolationshutdowntimeout
  • Apply an admission control policy for HA
    • Three types of admission control are available
      • Host- ensure host has enough resources to satisfy reservations
      • Resource pool – ensure RP has enough resources to satisfy reservation/shares/Limits of all VM’s in it.
      • vSphere HA – ensures enough resources are reserved in the cluster for VM recovery in the event of a host failure
        • The only type of admission control that can be disabled.
    • Admission control  imposes constraints on usage, and prohibits activities that would violate constraints.
      • Ex.\ powering on a VM
      • Ex.\ migrating a machine to a host or RP
      • Increasing CPU/Memory reservations
    • Define failover by number of hosts
      • New to 6.0 is the ability to have slot size automatically set (max VM reservation+overhead), or to specify a slot size
        • If your cluster contains any virtual machines that have much larger reservations than the others, they will distort slot size calculation. To avoid this, you can specify an upper bound for the CPU or memory component of the slot size by using the das.slotcpuinmhz or das.slotmeminmb advanced options, respectively. See “vSphere HA Advanced Options,” on page 35.
      • Ensure cluster hosts are sized equally, otherwise reserve excess capacity to account for the unbalanced cluster
      • With the Host Failures Cluster Tolerates policy, vSphere HA performs admission control in the following way:
        • 1 Calculates the slot size. A slot is a logical representation of memory and CPU resources. By default, it is sized to satisfy the requirements for any powered-on virtual machine in the cluster.
        • 2 Determines how many slots each host in the cluster can hold.
        • 3 Determines the Current Failover Capacity of the cluster. This is the number of hosts that can fail and still leave enough slots to satisfy all of the powered-on virtual machines.
        • 4 Determines whether the Current Failover Capacity is less than the Configured Failover Capacity (provided by the user). If it is, admission control disallows the operation.
    • Define failover capacity by reserving a percentage
      • VMware best practice is to use this policy
      • With the Cluster Resources Reserved policy, vSphere HA enforces admission control as follows:
        • 1 Calculates the total resource requirements for all powered-on virtual machines in the cluster.
        • 2 Calculates the total host resources available for virtual machines.
        • 3 Calculates the Current CPU Failover Capacity and Current Memory Failover Capacity for the cluster.
        • 4 Determines if either the Current CPU Failover Capacity or Current Memory Failover Capacity is less than the corresponding Configured Failover Capacity (provided by the user). If so, admission control disallows the operation.
      • Not as affected by heterogeneity of the cluster as number of hosts.
    • Define specific failover hosts
      • To ensure that spare capacity is available on a failover host, you are prevented from powering on virtual machines or using vMotion to migrate virtual machines to a failover host. Also, DRS does not use a failover host for load balancing.
      • Avoids resource fragmentation, because a host and it’s resources are sitting on standby
    • Only temporarily disable admission control, for example if you need to violate the failover constraints when there are not enough resources to support them–for example, if you are placing hosts in standby mode to test them for use with Distributed Power Management (DPM).
  • Enable/disable advanced vSphere HA settings
    • If host monitoring status is disabled, host isolation responses are also suspended
  • Configure different heartbeat datastores for an HA cluster
    • When the master host in a vSphere HA cluster can not communicate with a slave host over the management network, the master host uses datastore heartbeating to determine whether the slave host has failed, is in a network partition, or is network isolated. If the slave host has stopped datastore heartbeating, it is considered to have failed and its virtual machines are restarted elsewhere.
    • vSphere HA creates a directory at the root of each datastore that is used for both datastore heartbeating and for persisting the set of protected virtual machines. The name of the directory is .vSphere-HA. Do not delete or modify the files stored in this directory, because this can have an impact on operations.
    • Virtual SAN datastore cannot be used for datastore heartbeating. Therefore, if no other shared storage is accessible to all hosts in the cluster, there can be no heartbeat datastores in use. However, if you have storage that can be reached by an alternate network path that is independent of the Virtual SAN network, you can use it to set up a heartbeat datastore.
    • cluster-> manage->settings->HA->datastore for heartbeating
      • Automatically choose (default)
      • Use datastores only from the list
      • Use datastores from the list and complement automatically if needed.
    • You can use the advanced option das.heartbeatdsperhost to change the number of heartbeat datastores selected by vCenter Server for each host. The default is two and the maximum valid value is five.
  • Apply virtual machine monitoring for a cluster
    • cluster-> manage->settings->HA->Virtual Machine Monitoring
      • VM monitoring restarts individual VMs if their VMware tools heartbeats are not received within a set time. Application monitoring restarts individual VMs if their in-guest application heartbeats are not received within a set time
      • .You can also specify custom values for both monitoring sensitivity and the I/O stats interval by selecting the Custom checkbox.
      • Table 2‑1. VM Monitoring Settings
      • Setting;  Failure Interval (seconds);  Reset Period
        • High;        30;       1 hour
        • Medium; 60;      24 hours
        • Low           120;    7 days
    • Virtual Machines that have not shut down in 300 seconds, or the time specified in the advanced option das.isolationshutdowntimeout, are powered off.
    • Occasionally, virtual machines or applications that are still functioning properly stop sending heartbeats. To avoid unnecessary resets, the VM Monitoring service also monitors a virtual machine’s I/O activity. If no heartbeats are received within the failure interval, the I/O stats interval (a cluster-level attribute) is checked. The I/O stats interval determines if any disk or network activity has occurred for the virtual machine during the previous two minutes (120 seconds). If not, the virtual machine is reset. This default value (120 seconds) can be changed using the advanced option das.iostatsinterval.
    • Application monitoring requires configuring an application using the vmware SDK or an app that supports application monitoring, and setting up customized heartbeats for the application you’ll be monitoring
    • If a virtual machine has a datastore accessibility failure (either All Paths Down or Permanent Device Loss), the VM Monitoring service suspends resetting it until the failure has been addressed.
  • Configure Virtual Machine Component Protection (VMCP) settings
    • To use VMCP, cluster must contain hosts ESXi 6.0 or later
    • Pg14 of availability guide:
      • “A virtual machine “split-brain” condition can occur when a host becomes isolated or partitioned from a master host and the master host cannot communicate with it using heartbeat datastores. In this situation, the master host cannot determine that the host is alive and so declares it dead. The master host then attempts to restart the virtual machines that are running on the isolated or partitioned host. This attempt succeeds if the virtual machines remain running on the isolated/partitioned host and that host lost access to the virtual machines’ datastores when it became isolated or partitioned. A split-brain condition then exists because there are two instances of the virtual machine. However, only one instance is able to read or write the virtual machine’s virtual disks. VM Component Protection can be used to prevent this split-brain condition. When you enable VMCP with the aggressive setting, it monitors the datastore accessibility of powered-on virtual machines, and shuts down those that lose access to their datastores.
    • PDL failures
      • A virtual machine is automatically failed over to a new host unless you have configured VMCP only to Issue events.
      • Three possible setting options for “response for Datastore with Permanent Device Loss (PDL)”
        • Disabled
        • Issue events
        • Power off and restart VM’s
    • APD events
      • The response to APD events is more complex and accordingly the configuration is more fine-grained. After the user-configured Delay for VM failover for APD period has elapsed, the action taken depends on the policy you selected. An event will be issued and the virtual machine is restarted conservatively or aggressively. The conservative approach does not terminate the virtual machine if the success of the failover is unknown, for example in a network partition. The aggressive approach does terminate the virtual machine under these conditions. Neither approach terminates the virtual machine if there are insufficient resources in the cluster for the failover to succeed. If APD recovers before the user-configured Delay for VM failover for APD period has elapsed, you can choose to reset the affected virtual machines, which recovers the guest applications that were impacted by the IO failures.
      • Four possible settings for “Response for Datastore with All Paths Down (APD)”
        • Disabled
        • Issue Events
        • Power off and restart VM’s (conservative)
        • Power off and restart VM’s (aggressive)
    • VMCP does not support fault tolerance.
    • VMCP does not detect or respond to issues with vSAN or vvols.
    • VMCP does not protect against inaccessible RDM’s
    • cluster-> manage->settings->HA->enable “protect against storage connectivity loss”, then set the appropriate option under failure conditions and VM response
      • PDL
        • Disabled (VMCP does nothing)
        • Issue events
        • Power off and restart VMs
      • APD
        • Disabled (VMCP does nothing)
        • Issue events
        • Power off and restart VMs (conservative)
        • Power off and restart VMs (aggressive)
      • Delay for VM failover for APD
        • # minutes to wait
      • Response to APD after APD timeout elapses
        • What to do if APD is resolved before VM’s are restarted. It’s a means to restart VM’s that may be in an inconsistent state due to IO interruptions.
        • Disabled (VMCP does nothing)
        • Reset VM’s
    • VM monitoring sensitivity – defaults to high, but can set custom values for:
      • Failure interval (default 30 seconds)
      • Minimum uptime (default 120 seconds)
      • Maximum per-VM resets (default 3)
      • Maximum resets time window (default within 1 hour)
  • Explain how vSphere HA communicates with Distributed Resource Scheduler and Distributed Power Management
    • If you are using the vSphere Distributed Power Management (DPM) feature, in addition to migration recommendations, DRS provides host power state recommendations
    • vSphere HA can use DRS to try to adjust the cluster (for example, by bringing hosts out of standby mode or migrating virtual machines to defragment the cluster resources) so that HA can perform the failovers. If DPM is in manual mode, you might need to confirm host power-on recommendations. Similarly, if DRS is in manual mode, you might need to confirm migration recommendations.
    • When you edit a DRS affinity rule, select the checkbox or checkboxes that enforce the desired failover behavior for vSphere HA.
      • HA must respect VM anti-affinity rules during failover — if VMs with this rule would be placed together, the failover is aborted.
      • HA should respect VM to Host affinity rules during failover –vSphere HA attempts to place VMs with this rule on the specified hosts if at all possible.

Tools

Objective 9.2: Configure Advanced vSphere DRS Features

  • Configure VM-Host affinity/anti-affinity rules
    • When you add or edit an affinity rule, and the cluster’s current state is in violation of the rule, the system continues to operate and tries to correct the violation.
    • Create VM/Host Groups first, then create rules
    • There are ‘required’ rules (designated by “must”) and ‘preferential’ rules (designated by “should”.)
    • If a virtual machine is removed from the cluster, it loses its DRS group affiliation, even if it is later returned to the cluster.
    • If you create more than one VM-Host affinity rule, the rules are not ranked, but are applied equally. Be aware that this has implications for how the rules interact. For example, a virtual machine that belongs to two DRS groups, each of which belongs to a different required rule, can run only on hosts that belong to both of the host DRS groups represented in the rules.
    • When you create a VM-Host affinity rule, its ability to function in relation to other rules is not checked. So it is possible for you to create a rule that conflicts with the other rules you are using. When two VM-Host affinity rules conflict, the older one takes precedence and the newer rule is disabled. DRS only tries to satisfy enabled rules and disabled rules are ignored.
    • DRS, vSphere HA, and vSphere DPM never take any action that results in the violation of required affinity rules (those where the virtual machine DRS group ‘must run on’ or ‘must not run on’ the host DRS group). Accordingly, you should exercise caution when using this type of rule because of its potential to adversely affect the functioning of the cluster. If improperly used, required VM-Host affinity rules can fragment the cluster and inhibit the proper functioning of DRS, vSphere HA, and vSphere DPM.
    • A number of cluster functions are not performed if doing so would violate a ‘required’ affinity rule.
      • DRS does not evacuate virtual machines to place a host in maintenance mode.
      • DRS does not place virtual machines for power-on or load balance virtual machines.
      • vSphere HA does not perform failovers.
      • vSphere DPM does not optimize power management by placing hosts into standby mode.
    • To avoid these situations, exercise caution when creating more than one required affinity rule or consider using VM-Host affinity rules that are preferential only (those where the virtual machine DRS group ‘should run on’ or ‘should not run on’ the host DRS group). Ensure that the number of hosts in the cluster with which each virtual machine is affined is large enough that losing a host does not result in a lack of hosts on which the virtual machine can run. Preferential rules can be violated to allow the proper functioning of DRS, vSphere HA, and vSphere DPM.
  • Configure VM-VM affinity/anti-affinity rules
    • When you add or edit an affinity rule, and the cluster’s current state is in violation of the rule, the system continues to operate and tries to correct the violation.
    • Specifies affinity/anti-affinity between machines
    • Web client -> cluster-> manage -> configuration, VM/Host Rules
    • You can create and use multiple VM-VM affinity rules, however, this might lead to situations where the rules conflict with one another.
    • If two VM-VM affinity rules are in conflict, you cannot enable both. For example, if one rule keeps two virtual machines together and another rule keeps the same two virtual machines apart, you cannot enable both rules. Select one of the rules to apply and disable or remove the conflicting rule.
    • When two VM-VM affinity rules conflict, the older one takes precedence and the newer rule is disabled. DRS only tries to satisfy enabled rules and disabled rules are ignored. DRS gives higher precedence to preventing violations of anti-affinity rules than violations of affinity rules.
  • Enable/disable Distributed Resource Scheduler (DRS) affinity rules
    • Web client -> cluster-> manage -> configuration, VM/Host Rules, edit rule. Clear the check box to disable.
  • Configure the proper Distributed Resource Scheduler (DRS) automation level based on a set of business requirements
    • If the cluster or any of the virtual machines involved are manual or partially automated, vCenter Server does not take automatic actions to balance resources. Instead, the Summary page indicates that migration recommendations are available and the DRS Recommendations page displays recommendations for changes that make the most efficient use of resources across the cluster.
    • If the cluster and virtual machines involved are all fully automated, vCenter Server migrates running virtual machines between hosts as needed to ensure efficient use of cluster resources.
  • Explain how DRS affinity rules affect virtual machine placement
    • During a group power on The initial placement recommendations for group power-on attempts are provided on a per-cluster basis. If all of the placement-related actions for a group power-on attempt are in automatic mode, the virtual machines are powered on with no initial placement recommendation given. If placement-related actions for any of the virtual machines are in manual mode, the powering on of all of the virtual machines (including those that are in automatic mode) is manual and is included in an initial placement recommendation.
    • When you first power on a virtual machine in the cluster, DRS attempts to maintain proper load balancing by either placing the virtual machine on an appropriate host or making a recommendation.

Tools