I was working at a customer site doing
some VMware and Microsoft work. I came in around 1pm and couldn’t get to
anything. We couldn’t ping any VM’s. I could however ping the ESXi hosts. I
used the vSphere client to connect to each host individually and saw that all
VMs were turned off. At the time I had no clue why they were off. The only
thing I saw on one of the hosts was that we had lost network redundancy on one
of the hosts.
I asked one of the guys if they did
anything on the hosts and they said no. I talked to another guy and he had
someone else reboot the switch. A little background, all of the ESXi hosts are
connected to the same switch (bad) and didn’t separate NICs. When the switch
was unplugged all of the hosts determined that they were in Host isolation mode
and the way HA was set up was that they were supposed to shut down the VMs.
The purpose of this is so that if
one host does get isolated then the VMs will shut down and get restarted on a
different host. However since all hosts believed they were isolated they all
shut down their respective VMs and sat there doing nothing with all VMs shut
off. What would have prevented this would have been to have 2 switches and one
NIC connected to each switch. I didn’t design this network, and know it is not
best practice to do this, but it was interesting to see this happen. We were
doing maintenance on the weekend, so it was no big deal, but imagine the
repurcussions is you only have one switch and it fails during the day and all
VMs go down. Not good.