Customers continue to demand more from their existing resources.
In this session Niels Hagoort from HIC and Frank Denneman from VMware discuss advanced compute, storage and network design with a view to achieve just that.
Getting started
Frank begins the session by stating that whilst at Pernix he had access to 90,000 servers, 85% of which were dual-socket. The memory configuration of these servers were often 256, 384 or 512GB of RAM.
Modern dual-socket CPU servers are non-NUMA systems. Each CPU has an integrated memory controller, and attached are four channels – this is called “local memory”.
Between each CPU is the QPI – Quick Path Interlink, which allows the CPU to access memory attached to the other CPU (remote access). This introduces latency and bandwidth issues, and hence why it’s referred to as “non-uniform”.
NUMA Focus Points
- Cache snoop modes
- DIMM configuration
- Size VM match CPU technology
Each CPU core has a level 1 and level 2 cache that is private and only they can access. There is another (last level) cache that is shared amongst all cores. When a core needs to access data on another core, it needs to verify the data is recent, so it “snoops” around.
To allow traffic to move across all cores, there is a ring topology in place (shown in red below). It is feasible that communication could flow from core 1 to core 23, which is also non-uniform:

Image courtesy of Frank Denneman
To increase efficiency, Intel introduced Snoop Modes. An example of this is the “Home Snoop” mode, which utilizes a kind of directory service to keep track of data. This prevents broadcast storms.
Cluster-on-Die
There are multiple core count configurations:
- Low
- Medium
- High
With medium and high core counts you can do “cluster-on-die”. This separates the two rings and creates two NUMA domains.
DIMM configuration
The following diagram shows two NUMA nodes. Each node has a memory controller with four channels:

Image courtesy of Frank Denneman
With four memory modules on each channel, this creates a region. The bandwidth of all four channels can be consumed. If you want to use 16GB DIMMs (currently the most popular), but also wish to have 384GB, then you get this:

Image courtesy of Frank Denneman
Despite having 2400MHz DIMMs, you will not get all the bandwidth performance. This is due to the number of ranks and electrical charge load. In short, try to avoid a 3 DIMM-per-channel configuration.
However if your design dictates 384GB of RAM, then it is best to reduce the amount of ranks. So to achieve 384GB you would use the following design:

Image courtesy of Frank Denneman
However, whilst performance is increased – differing hardware could lead to other complications (uniformity is better). Therefore it would be better to scale-up to 512GB of RAM:

Image courtesy of Frank Denneman
Right-size your VM
In ESXi, the CPU scheduler assigns CPU time to virtual machines, but the NUMA scheduler performs the initial placement. Therefore vCPU design impacts your initial placement and load-balancing.
Storage
Data access in terms of latency:
- CPU register
- L1/L3 cache
- Local memory
- Disk
Every effort should be made to minimise this. The industry is moving to NVMe, however unfortunately this currently exceeds controller bandwidth.
Not all storage drivers are created equally. In one test with an Intel P3700 disk, Frank used the “inbox” (default) driver with ESXi, and then ran the same test with the async Intel driver. The results were radically different (in favour of the Intel driver).
Networking
VMware’s Software-defined Networking product, NSX, uses VXLAN. This creates an additional layer of packet processing, therefore consuming more CPU cycles. To prevent the CPU being taxed, VXLAN offloading can be used.
Traditional offloading is TCP based, which will not work as VXLAN is encapsulated.
When purchasing new hardware, look for network virtualisation features such as SR-IOV and VXLAN offloading. Also check the driver stack supports the feature you need, not just the hardware. To do this, use the esxcli network driver command to list your driver, ethtool to view the version, then the VMware HCL to verify. Finally, verify the features you require are turned on by default in the driver.
Receive Side Scaling (RSS) and NetQueue both require physical support. RSS uses hashing functionality (IP/TCP/mac address).

Image courtesy of Niels Hagoort / VMware
Wrap up
That was an extremely technical session focusing on advanced topics. Anyone preparing their VCDX design should bookmark the VMworld playback session of this for future reference.