VMworld 2016 Europe: INF8430 – vSphere 6.x Host Resource Deep Dive

20160914-1Customers continue to demand more from their existing resources.

In this session Niels Hagoort from HIC and Frank Denneman from VMware discuss advanced compute, storage and network design with a view to achieve just that.

Getting started

Frank begins the session by stating that whilst at Pernix he had access to 90,000 servers, 85% of which were dual-socket. The memory configuration of these servers were often 256, 384 or 512GB of RAM.

Modern dual-socket CPU servers are non-NUMA systems. Each CPU has an integrated memory controller, and attached are four channels – this is called “local memory”.

Between each CPU is the QPI – Quick Path Interlink, which allows the CPU to access memory attached to the other CPU (remote access). This introduces latency and bandwidth issues, and hence why it’s referred to as “non-uniform”.

NUMA Focus Points

  • Cache snoop modes
  • DIMM configuration
  • Size VM match CPU technology

Each CPU core has a level 1 and level 2 cache that is private and only they can access. There is another (last level) cache that is shared amongst all cores. When a core needs to access data on another core, it needs to verify the data is recent, so it “snoops” around.

To allow traffic to move across all cores, there is a ring topology in place (shown in red below). It is feasible that communication could flow from core 1 to core 23, which is also non-uniform:

Image courtesy of Frank Denneman

Image courtesy of Frank Denneman

To increase efficiency, Intel introduced Snoop Modes. An example of this is the “Home Snoop” mode, which utilizes a kind of directory service to keep track of data. This prevents broadcast storms.

Cluster-on-Die

There are multiple core count configurations:

  • Low
  • Medium
  • High

With medium and high core counts you can do “cluster-on-die”. This separates the two rings and creates two NUMA domains.

DIMM configuration

The following diagram shows two NUMA nodes. Each node has a memory controller with four channels:

Image courtesy of Frank Denneman

Image courtesy of Frank Denneman

With four memory modules on each channel, this creates a region. The bandwidth of all four channels can be consumed. If you want to use 16GB DIMMs (currently the most popular), but also wish to have 384GB, then you get this:

Image courtesy of Frank Denneman

Image courtesy of Frank Denneman

Despite having 2400MHz DIMMs, you will not get all the bandwidth performance. This is due to the number of ranks and electrical charge load. In short, try to avoid a 3 DIMM-per-channel configuration.

However if your design dictates 384GB of RAM, then it is best to reduce the amount of ranks. So to achieve 384GB you would use the following design:

Image courtesy of Frank Denneman

Image courtesy of Frank Denneman

However, whilst performance is increased – differing hardware could lead to other complications (uniformity is better). Therefore it would be better to scale-up to 512GB of RAM:

Image courtesy of Frank Denneman

Image courtesy of Frank Denneman

Right-size your VM

In ESXi, the CPU scheduler assigns CPU time to virtual machines, but the NUMA scheduler performs the initial placement. Therefore vCPU design impacts your initial placement and load-balancing.

Storage

Data access in terms of latency:

  • CPU register
  • L1/L3 cache
  • Local memory
  • Disk

Every effort should be made to minimise this. The industry is moving to NVMe, however unfortunately this currently exceeds controller bandwidth.

Not all storage drivers are created equally. In one test with an Intel P3700 disk, Frank used the “inbox” (default) driver with ESXi, and then ran the same test with the async Intel driver. The results were radically different (in favour of the Intel driver).

Networking

VMware’s Software-defined Networking product, NSX, uses VXLAN. This creates an additional layer of packet processing, therefore consuming more CPU cycles. To prevent the CPU being taxed, VXLAN offloading can be used.

Traditional offloading is TCP based, which will not work as VXLAN is encapsulated.

When purchasing new hardware, look for network virtualisation features such as SR-IOV and VXLAN offloading. Also check the driver stack supports the feature you need, not just the hardware. To do this, use the esxcli network driver command to list your driver, ethtool to view the version, then the VMware HCL to verify. Finally, verify the features you require are turned on by default in the driver.

Receive Side Scaling (RSS) and NetQueue both require physical support. RSS uses hashing functionality (IP/TCP/mac address).

Image courtesy of Niels Hagoort / VMware

Image courtesy of Niels Hagoort / VMware

Wrap up

That was an extremely technical session focusing on advanced topics. Anyone preparing their VCDX design should bookmark the VMworld playback session of this for future reference.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.