Thursday, March 28, 2013

Improving VM to VM network throughput on an ESXi platform


Recently I virtualized most of the servers I had at home into an ESXi 5.1 platform. This post would follow my journey to achieve better network performance between the VMs.

I am quite happy with the setup as it allowed me to eliminate 5-6 physical boxes in favor of one (very strong) machine. I was also able to achieve some performance improvements  but not to the degree I hoped to see.

I have a variety of machines running in a virtualized form:
1. Windows 8 as my primary desktop, passing dedicated GPU and USB card.from the host to the VM using VMDirectPath
2. Multiple Linux servers
3. Solaris 11.1 as NAS, running the great napp-it software (http://www.napp-it.org/) 

All the machines have the latest VMware Tools installed and running paravirtualized drivers where possible.

VM to VM network performance has been great between the Windows/Linux boxes once I enabled Jumbo Frames. 
Throughout this post I'll use iperf to measure network performance. It's a great and easy to use tool and you can find precompiled version for almost any operating system. http://iperf.fr/

Let's start with an example of network throughput performance from the Windows 8 Machine to Linux:










11.3 Gbps, not bad. CPU utilization was around 25% on the windows box throughout the test.

Network performance between the Solaris VM and any other machine on the host was relatively bad. 
I started by using the E1000G virtual adapter, as recommended by VMware for Solaris 11 (http://kb.vmware.com/kb/2032669). We'll use one of my Linux VMs (at 192.168.1.202) as a server for these tests. using iperf to test:















1.36 Gbps. Not bad between physical servers, but unacceptable between VMs on the same host. also notice the very high CPU utilization during the test - around 80% system time.

My immediate instinct was to enable jumbo frames. Although the adapter driver is supposed to support jumbo frames, I was unable to enable it no matter how hard I fought it. 


root@solaris-lab:/kernel/drv# dladm set-linkprop -p mtu=9000 net0
dladm: warning: cannot set link property 'mtu' on 'net0': link busy

I gave up on getting better performance from the E1000G adapter and switched to VMXNET3. I immediately saw improvement:















2.31 Gbps. but more importantly, the cpu utilization was much lower.

Now let's try to enable jumbo frames for the vmxnet3 adapter - followed the steps in http://kb.vmware.com/kb/2012445 and http://kb.vmware.com/kb/2032669 without success. The commands work, but jumbo frames were not really enabled. we can test with 9000 byte ping -

root@solaris-lab:~# ping -s 192.168.1.202 9000 4
PING 192.168.1.202: 9000 data bytes
----192.168.1.202 PING Statistics----
4 packets transmitted, 0 packets received, 100% packet loss


As my next step I was planning on running some dtrace commands, and accidentally noticed that the drivers I have installed are the Solaris 10 version and not the Solaris 11 version.


root@solaris-lab:~/vmware-tools-distrib# find /kernel/drv/ -ls |grep vmxnet3
78669    2 -rw-r--r--   1 root     root         1071 Mar 27 01:42 /kernel/drv/vmxnet3s.conf
78671   34 -rw-r--r--   1 root     root        34104 Mar 27 01:42 /kernel/drv/amd64/vmxnet3s
78670   25 -rw-r--r--   1 root     root        24440 Mar 27 01:42 /kernel/drv/vmxnet3s
root@solaris-lab:~/vmware-tools-distrib# find . -ls |grep vmxnet3
  231   25 -rw-r--r--   1 root     root        24528 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s
  234    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s.conf
  250    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s.conf
  244   25 -rw-r--r--   1 root     root        24440 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s
  262   34 -rw-r--r--   1 root     root        34104 Nov 17 07:55 ./lib/modules/binary/10_64/vmxnet3s
  237   35 -rw-r--r--   1 root     root        35240 Nov 17 07:55 ./lib/modules/binary/11_64/vmxnet3s
  227   34 -rw-r--r--   1 root     root        34256 Nov 17 07:55 ./lib/modules/binary/2009.06_64/vmxnet3s
  253   25 -rw-r--r--   1 root     root        24672 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s
  259    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s.conf


This is very strange as installation of the Tools is a straightforward procedure with no room for user error.

So I decided to open the Tools installation script (perl) and found an interesting bug -


...
sub configure_module_solaris {
  my $module = shift;
  my %patch;
  my $dir = db_get_answer('LIBDIR') . '/modules/binary/';
  my ($major, $minor) = solaris_os_version();
  my $os_name = solaris_os_name();
  my $osDir;
  my $osFlavorDir;
  my $currentMinor = 10;   # The most recent version we build the drivers for

  if (solaris_10_or_greater() ne "yes") {
    print "VMware Tools for Solaris is only available for Solaris 10 and later.\n";
    return 'no';
  }

  if ($minor < $currentMinor) {
    $osDir = $minor;
  } else {
    $osDir = $currentMinor;
  }
For Solaris 11.1, $minor is 11, which forces $osDir to be Solaris 10. Bug ?
Either way it's very easy to fix - just change "<" to ">":

if ($minor > $currentMinor) {

Re-install Tools using the modified script and reboot. 
Let's check the installed driver now:



root@solaris-lab:~/vmware-tools-distrib# find /kernel/drv/ -ls |grep vmxnet3
79085    2 -rw-r--r--   1 root     root         1071 Mar 27 02:00 /kernel/drv/vmxnet3s.conf
79087   35 -rw-r--r--   1 root     root        35240 Mar 27 02:00 /kernel/drv/amd64/vmxnet3s
79086   25 -rw-r--r--   1 root     root        24672 Mar 27 02:00 /kernel/drv/vmxnet3s
root@solaris-lab:~/vmware-tools-distrib# find . -ls |grep vmxnet3
  231   25 -rw-r--r--   1 root     root        24528 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s
  234    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s.conf
  250    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s.conf
  244   25 -rw-r--r--   1 root     root        24440 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s
  262   34 -rw-r--r--   1 root     root        34104 Nov 17 07:55 ./lib/modules/binary/10_64/vmxnet3s
  237   35 -rw-r--r--   1 root     root        35240 Nov 17 07:55 ./lib/modules/binary/11_64/vmxnet3s
  227   34 -rw-r--r--   1 root     root        34256 Nov 17 07:55 ./lib/modules/binary/2009.06_64/vmxnet3s
  253   25 -rw-r--r--   1 root     root        24672 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s
  259    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s.conf

Now we have the correct version installed. 

Let's enable jumbo-frames as before and check if it made any difference:

root@solaris-lab:~# ping -s 192.168.1.202 9000 4
PING 192.168.1.202: 9000 data bytes
9008 bytes from 192.168.1.202: icmp_seq=0. time=0.338 ms
9008 bytes from 192.168.1.202: icmp_seq=1. time=0.230 ms
9008 bytes from 192.168.1.202: icmp_seq=2. time=0.289 ms
9008 bytes from 192.168.1.202: icmp_seq=3. time=0.294 ms
----192.168.1.202 PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 0.230/0.288/0.338/0.044

Success! jumbo-frames are working.


Let's test throughput with iperf:















Less than 1Mb/s, not what we expected at all!
Need to take a deeper look at the packets being sent. Let's use tcpdump to create a trace file:

root@solaris-lab:~# tcpdump -w pkts.pcap -s 100 -inet1 & PID=$! ; sleep 1s ; ./iperf -t1 -c192.168.1.202; kill $PID
[1] 1726
tcpdump: listening on net1, link-type EN10MB (Ethernet), capture size 100 bytes
------------------------------------------------------------
Client connecting to 192.168.1.202, TCP port 5001
TCP window size: 48.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.206 port 35084 connected with 192.168.1.202 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.3 sec    168 KBytes  1.02 Mbits/sec
70 packets captured
70 packets received by filter
0 packets dropped by kernel

and open it in Wireshark for easier analysis:












The problem is clear with packet 7 - the driver is trying to send a 16KB packet, above our 9K MTU jumbo frame. This packet is not received outside of the VM and after a timeout it is being fragmented and retransmitted. This happens again for every packet generating a massive delay and causes throughput to be very low.

Reviewing the vmxnet3 driver source (open source at http://sourceforge.net/projects/open-vm-tools/) it seems the only way a packet larger than the MTU to be sent is if the LSO feature is enabled. 
To learn more about LSO (Large Segment Offload) read http://en.wikipedia.org/wiki/Large_segment_offload.
Essentially, the kernel is sending large packets (16K in the capture) and the NIC (or virtual NIC) is supposed to fragment the packet and transmit valid-size packets. On a real hardware NIC, at high speeds, this saves considerable amounts of CPU. in a virtualized environment I don't see the benefit. And it seems to be badly broken.

Let's disable LSO:

ndd -set /dev/ip ip_lso_outbound 0

And try to run iperf again:














12.1 Gbps, SUCCESS!

Now that that we are able to transmit from Solaris out in decent rates, let's check the performance of connections into the Solaris VM:









3.74 Gbps, not bad, but we can do better - let's at least get to 10Gbps.

Next step is to tune the TCP parameters to accommodate the higher speed needed - the buffers are simply too small for the amount of data in flight -


root@solaris-lab:~# ipadm set-prop -p max_buf=4194304 tcp
root@solaris-lab:~# ipadm set-prop -p recv_buf=1048576 tcp
root@solaris-lab:~# ipadm set-prop -p send_buf=1048576 tcp

And run iperf again:













18.3 Gbps, Not bad!