My (self-inflicted) VSAN nightmare

A few weeks back I completely refreshed my homelab to include three new hosts and VMware VSAN. Despite being initially pleased with the setup, there was one job I forgot to do before configuring VSAN.

And it has been pain all the way since.

Whenever I buy any new hardware for either a lab environment or production I always update the firmware to the latest revision. This ensures that any nasty gremlins in the code have been ironed out, and that you’re more likely to be supported if you run into any issues.

After coming to terms with the fact that the Dell PERC H700 cards I bought were not going to work, I bought three IBM M1015 cards off eBay and installed them. At this point all my lab VMs were currently living on one 4TB SATA drive plugged into one of the servers, and I desperately wanted to move them back to a more stable storage environment.

20160502 - 2

As proponents of VMware VSAN will often tell you, it’s trivially easy to setup. Three-clicks later I had a VSAN 6.2 datastore configured and began copying across all my VMs.

I always knew I would have to go back and update the firmware, so after the copying finished I created a USB drive using Rufus and downloaded the necessary firmware.

And that’s when it all started to go wrong.

Mistake # 1

The IBM ServeRAID M1015 SAS/SATA controllers are actually MegaRAID 9240-8i cards by Avago Technologies (formally LSI). Whilst described by IBM as an “entry-level 6 Gbps RAID controller”, they do the job and more importantly are on the VSAN Hardware Compatibility List (HCL).

However in their default RAID mode, they have an Average Queue Length of 25, which is woefully inadequate for VSAN (VMware recommend a minimum of 256).

Knowing the M1015 is capable of an AQLEN of 600, I began flashing the firmware of each card into IT mode. This all enables passthru, so ESXi no longer sees each individual disk as a RAID0 but the disk it really is. To enable faster booting, I also removed the ROM BIOS from each card.

The first host rebooted and appeared fine. The vsanDatastore was still there, however I didn’t consult the VSAN health dashboard to confirm, and just began the process of bringing down the cluster to flash the remaining two cards.

Mistake # 2

After about an hour the servers were all back up and each ESXi host saw a caching disk, a capacity disk, and a vsanDatastore. I didn’t have time to fire up any VMs as I had to dash to Schiphol for the Scottish VMUG, but would do it later whilst waiting for my flight.

Or so I thought.

When I connected to my three hosts using the Embedded Host Client, each one showed a different number of VMs, and all invalid. After investigating the storage I spotted the issue… the vsanDatastore was empty! It appears VSAN doesn’t like having the volumes changed around, and despite appearing to work, the reality is very much the opposite.

Nothing to see here... move along

Nothing to see here… move along

Convinced a network issue might be the cause (I also bounced the 10GbE switch at the same time), I used the following command on each host to confirm multicast traffic was flowing as expected:

tcpdump-uw -i vmk4 -n -s0 -t -c 20 udp port 12345 or udp port 23451
  • -i: interface (I remember using vmk4 on my dvSwitch for VSAN. Check using esxcfg-vmknic -l)
  • -n: do not resolve IP information
  • -s0: collect entire packet
  • -t: no timestamp
  • -c: number of frames to capture
  • udp/12345: master group multicast port
  • udp/23451: host unicast channel port

Networking it would appear, was not the issue:
20160502 - 5It must be storage-related. Having realised my mistake, the fear set in and I dumped the logs and rapidly backed out my firmware upgrade. I brought each host down and re-flashed them back to their IBM defaults. Unfortunately the damage may have already been done.At this point I began to cry to anyone who’ll listen on the vExpert Slack channel. Fortunately John Nicholson and Jase McCarty from VMware heard my cries and have been helping, along with Dave Edwards, ever since.

I’m a long way from dry land with regards to all my VMs coming back, but I remain optimistic. Update to follow.

However I am reminded of the old adage… “Jesus saves, God backs up”.

Mistake # 3

Whilst talking to John I was reminded of why adhearing to the HCL is imperative. It appears the Samsung SSDs I’d previously bought for my VSAN are not supported, and for a very good reason.

When Samsung refer to their drives as “Pro” they actually mean “Pro Gamer”, and whilst being wicked fast – lack the critical functionality of power-loss protection. So if your hosts lose power at any point, you’re likely to encounter data corruption where you least want it.

With this in mind I decided it was best to replace the Samsung 850 Pro 1TB drives with something supported. I considered Intel drives as these appear to be the de-facto standard, but they’re not cheap… even for refurbished ones:

20160502 - 7

Whilst slightly more expensive, I purchased three Samsung PM863 480GB drives for caching and six SM863 960Gb’s for capacity. When they arrive I shall use the Samsung 950 Pro 512GB M.2’s that currently serve as the  cache tier as the ESXi boot volume. If nothing else, ESXi will boot like grease lightening.

At this point I’m considering asking HPE to tender a bid for running my homelab, as it’d surely be cheaper…

4 thoughts on “My (self-inflicted) VSAN nightmare

  1. I know this all would be a pain. if it is only for Home-Lab I would recommend Intel NUC to make things easy and less expensive. Anyways we need to use our existing H/W and should not have these issues like above. Thanks For sharing(The PowerLoss Feature is something I am unaware Off).

    Like

    • Hi,

      I know a lot of people are fans of the NUCs, but for my workloads they just don’t have enough horsepower, even for a homelab.

      These hosts have 128GB each… just enough!

      -Mark

      Like

  2. Pingback: Newsletter: May 7, 2016 | Notes from MWhite

  3. Old entry, but I experiences somewhat the same with Storage Spaces in Windows Server 2012 R2 some time ago, I had to re-arrange the disks and I could not, in any way repair my Storage Space. At that time is was just an experiment where I’d move all my Enterprise ISOs but did not move them, just copy them *phew*

    Now I’m running two QNAP NASes, one with 16TB and 10Gige iSCSI, and a secondary as a backup NAS with ~10TB of storage as an backup over 2xGigE. I also have attached an Fiber Channel NAS (4×4 Gig) to my two hosts where I plan to run my SSD disks that I currently have as local storage.

    Currently having 2 hosts with 64 Gig ram each and E5-2630 CPUs.

    //Anders – http://www.direktorn.com

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s