I was working at a customer site doing some VMware and Microsoft work. I came in around 1pm and couldn’t get to anything. We couldn’t ping any VM’s. I could however ping the ESXi hosts. I used the vSphere client to connect to each host individually and saw that all VMs were turned off. At the time I had no clue why they were off. The only thing I saw on one of the hosts was that we had lost network redundancy on one of the hosts.
I asked one of the guys if they did anything on the hosts and they said no. I talked to another guy and he had someone else reboot the switch. A little background, all of the ESXi hosts are connected to the same switch (bad) and didn’t separate NICs. When the switch was unplugged all of the hosts determined that they were in Host isolation mode and the way HA was set up was that they were supposed to shut down the VMs.
The purpose of this is so that if one host does get isolated then the VMs will shut down and get restarted on a different host. However since all hosts believed they were isolated they all shut down their respective VMs and sat there doing nothing with all VMs shut off. What would have prevented this would have been to have 2 switches and one NIC connected to each switch. I didn’t design this network, and know it is not best practice to do this, but it was interesting to see this happen. We were doing maintenance on the weekend, so it was no big deal, but imagine the repurcussions is you only have one switch and it fails during the day and all VMs go down. Not good.