Esxtop to the rescue !!

A request came in to investigate the performance of a 16 GB, 4 vCPU database VM , specifically the CPU performance. The usual starting place of the vCenter performance charts showed that actual vCPU utilisation wasn’t unusually high for sustained periods. So off to ESXTOP for a little more of a deep dive.

ESXTOP

It had been a while since using ESXTOP to identify issues with CPU, so a quick read up of yet another article from Duncan Epping over at www.yellow-bricks.com/esxtop , and a read of the VMware KB Article 1017926.

The 2 main metrics I was interested in are:-

Ready, %RDY:

This value represents the percentage of time that the virtual machine is ready to execute commands, but has not yet been scheduled for CPU time due to contention with other virtual machines.
Compare against the Max-Limited, %MLMTD value. This represents the amount of time that the virtual machine was ready to execute, but has not been scheduled for CPU time because the VMkernel deliberately constrained it. For more information, see the Managing Resource Pools section of the vSphere Monitoring and Performance Guide or Resource Management Guide.
If the virtual machine is unresponsive or very slow and %MLMTD is low it may indicate that the ESX host has limited CPU time to schedule for this virtual machine.

Co-stop, %CSTP:

This value represents the percentage of time that the virtual machine is ready to execute commands but that it is waiting for the availability of multiple CPUs as the virtual machine is configured to use multiple vCPUs.
If the virtual machine is unresponsive and %CTSP is proportionally high compared to %RUN, it may indicate that the ESX host has limited CPU resources simultaneously co-schedule all vCPUs in this virtual machine.
Review the usage of virtual machines running with multiple vCPUs on this host. For example, a virtual machine with four vCPUs may need to schedule 4 pCPUs to do an operation. If there are multiple virtual machines configured in this way, it may lead to CPU contention and resource starvation.

Esxtop can provide an absolute plethora of information, and drilling down to the required details can be very daunting. I managed to find a very informative article from Joshua Townsend , ‘The Skinny on ESXTOP’ , about stripping ESXTOP output down to a more manageable size.

Saving the resultant config as esxtop-cpu-only

running ESXTOP in batch mode ,

esxtop -b -d 10 -n 90 -c esxtop-cpu-only | gzip -9c > /tmp/esxtop-cpu-only.csv.gz

Where:-

-b = batch mode

-d = delay in secs (10 secs)

-n = number of samples (90 samples)

-c = configuration file (esxtop-cpu-only)

taking samples ever 10 secs and a sample rate of 90 , gave me an overall sample window of 15mins.

Once the capture was complete I used the excellent tool ESXPLOT to graph the results.

  1. Run: esxplot
  2. Click File -> Import -> Dataset
  3. Select file and click ‘Open’
  4. Double click host name and click on metric

 

The 1st metric looked at was %CSTOP, as you will see from the graph the value, was pretty high in places.

bar-dun-db1_CoStop_2

While the %RDY was relatively normal

bar-dun-db1_Ready_2

So %CSTP appeared to be indicating an issue with excessive usage of vSMP. A quick google search for ‘High %CSTP’ flagged up the following article

http://yuridejager.wordpress.com/2012/05/01/high-cstp-in-your-vm-delete-snapshot-failed-how-to-consolidate-orphaned-snapshots-into-your-vm/

and KB Article 2000058

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2000058

which referred to an issue with snapshots.

I checked the VM for snapshots and there was indeed a snapshot still running on the VM , and had been for the past 4-5 days. Deleting the snapshot took some considerable time, but once completed I ran the same ESXTOP batch command, and sampled the metrics over a 15 min period.

From the following graph you will see the %CSTP value has almost dropped away closer to zero.

bar-dun-db1_CoStop_3

ESXTOP to the rescue !!