This week I had an issue with my NSX installation across multiple clusters. First it appeared in my management cluster, then my resource and edge clusters. Little did I know my own stupidity was to blame…
I decided this week it was time to do a little housekeeping in the lab. Hosts had fallen out of compliance for one reason or another and I was too busy to do anything about it (despite alarms going off).
I ran a check on each cluster, noted the bits that needed fixing, and then set about doing it. Where necessary I updated each profile from the required reference host.
Then the gremlins started to appear.
Shortly after this period of righteous profile application, NSX components started failing with gusto. Lab environments using on-demand NAT networks provisioned through vRealize Automation fell off the map. Routed subnets could no longer get out into the big wide world. And in the vSphere Web Client, things looked a lot less rosy:
Anyone who’s been working with NSX for while will know that the first two are usually related to the netcpad and vShield-Stateful-Firewall services being down or broken. Alternatively this can also be due to communication errors between the NSX controllers and ESXi hosts. However for the Control Plane Agent to Controller to be in an unknown state is unusual.
This gremlin just had to die.
Despite being told by everyone from VMware staffers to Santa Claus to stop using the C# Client, regrettably I still do. I know I know, I just can’t help it. What can I tell you, I’m a junkie.
After updating the profile reference host, reapplying said profile to the cluster and then appempting to remediate it, I got the following errors on the remaining hosts:
Failures Against Host Profile
Option UserVars.RmqClientId doesn’t match the specified criteria
Option UserVars.RmqClientRequestQueue doesn’t match the specified criteria
Option UserVars.RmqClientResponseQueue doesn’t match the specified criteria
Option UserVars.RmqClientToken doesn’t match the specified criteria
Option UserVars.RmqHostId doesn’t match the specified criteria
Option UserVars.RmqPassword doesn’t match the specified criteria
Option UserVars.RmqUsername doesn’t match the specified criteria
Rather than take the time to investigate what “Rmq” stood for, I put each host into maintenance mode and re-applied the profile. Bad move.
NSX communicates with each ESXi host using Advanced Message Queuing Protocol (AMQP), which is an open standard application layer protocol for message-orientated middleware. In NSX, the AMQP service is provided by Rabbit MQ.
The vShield-Stateful-Firewall service sits on each host and listens on tcp/5671 for communication from the Rabbit MQ server that sits on the NSX manager using the message bus.
Each host has individual Rabbit MQ (“Rmq”) settings (ClientId, HostId, username etc) which are unique to each one. By enforcing the setting from one host to every host in the cluster, I had effectively crippled NSX communication for all but one host.
Whilst I can’t actually prove it, I’m assuming the Web Client is smart enough not to apply these values to a host that has NSX software installed. As the C# Client is unaware of NSX, it happily applies them… and then laughs as your infrastructure burns down.
If you really, really, really must resolve this issue using the C# Client, you can simply disable the appropriate options in the profile:
My advice however would be to resolve it in the vSphere Web Client.
TL;DR – Stop using the C# Client for anything that can affect NSX. Also, remember that certain things in the host profile are unique to each host, even if it doesn’t always know it.
Happy Friday 😉