Recently I virtualized most of the servers I had at home into an ESXi 5.1 platform. This post would follow my journey to achieve better network performance between the VMs.
I am quite happy with the setup as it allowed me to eliminate 5-6 physical boxes in favor of one (very strong) machine. I was also able to achieve some performance improvements but not to the degree I hoped to see.
I have a variety of machines running in a virtualized form:
1. Windows 8 as my primary desktop, passing dedicated GPU and USB card.from the host to the VM using VMDirectPath
2. Multiple Linux servers
3. Solaris 11.1 as NAS, running the great napp-it software (http://www.napp-it.org/)
All the machines have the latest VMware Tools installed and running paravirtualized drivers where possible.
VM to VM network performance has been great between the Windows/Linux boxes once I enabled Jumbo Frames.
Throughout this post I'll use iperf to measure network performance. It's a great and easy to use tool and you can find precompiled version for almost any operating system. http://iperf.fr/
Let's start with an example of network throughput performance from the Windows 8 Machine to Linux:
11.3 Gbps, not bad. CPU utilization was around 25% on the windows box throughout the test.
Network performance between the Solaris VM and any other machine on the host was relatively bad.
I started by using the E1000G virtual adapter, as recommended by VMware for Solaris 11 (http://kb.vmware.com/kb/2032669). We'll use one of my Linux VMs (at 192.168.1.202) as a server for these tests. using iperf to test:
1.36 Gbps. Not bad between physical servers, but unacceptable between VMs on the same host. also notice the very high CPU utilization during the test - around 80% system time.
My immediate instinct was to enable jumbo frames. Although the adapter driver is supposed to support jumbo frames, I was unable to enable it no matter how hard I fought it.
root@solaris-lab:/kernel/drv# dladm set-linkprop -p mtu=9000 net0
dladm: warning: cannot set link property 'mtu' on 'net0': link busy
I gave up on getting better performance from the E1000G adapter and switched to VMXNET3. I immediately saw improvement:
2.31 Gbps. but more importantly, the cpu utilization was much lower.
Now let's try to enable jumbo frames for the vmxnet3 adapter - followed the steps in http://kb.vmware.com/kb/2012445 and http://kb.vmware.com/kb/2032669 without success. The commands work, but jumbo frames were not really enabled. we can test with 9000 byte ping -
root@solaris-lab:~# ping -s 192.168.1.202 9000 4
PING 192.168.1.202: 9000 data bytes
----192.168.1.202 PING Statistics----
4 packets transmitted, 0 packets received, 100% packet loss
As my next step I was planning on running some dtrace commands, and accidentally noticed that the drivers I have installed are the Solaris 10 version and not the Solaris 11 version.
root@solaris-lab:~/vmware-tools-distrib# find /kernel/drv/ -ls |grep vmxnet3
78669 2 -rw-r--r-- 1 root root 1071 Mar 27 01:42 /kernel/drv/vmxnet3s.conf
78671 34 -rw-r--r-- 1 root root 34104 Mar 27 01:42 /kernel/drv/amd64/vmxnet3s
78670 25 -rw-r--r-- 1 root root 24440 Mar 27 01:42 /kernel/drv/vmxnet3s
root@solaris-lab:~/vmware-tools-distrib# find . -ls |grep vmxnet3
231 25 -rw-r--r-- 1 root root 24528 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s
234 2 -rw-r--r-- 1 root root 1071 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s.conf
250 2 -rw-r--r-- 1 root root 1071 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s.conf
244 25 -rw-r--r-- 1 root root 24440 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s
262 34 -rw-r--r-- 1 root root 34104 Nov 17 07:55 ./lib/modules/binary/10_64/vmxnet3s
237 35 -rw-r--r-- 1 root root 35240 Nov 17 07:55 ./lib/modules/binary/11_64/vmxnet3s
227 34 -rw-r--r-- 1 root root 34256 Nov 17 07:55 ./lib/modules/binary/2009.06_64/vmxnet3s
253 25 -rw-r--r-- 1 root root 24672 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s
259 2 -rw-r--r-- 1 root root 1071 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s.conf
This is very strange as installation of the Tools is a straightforward procedure with no room for user error.
So I decided to open the Tools installation script (perl) and found an interesting bug -
...
sub configure_module_solaris {
my $module = shift;
my %patch;
my $dir = db_get_answer('LIBDIR') . '/modules/binary/';
my ($major, $minor) = solaris_os_version();
my $os_name = solaris_os_name();
my $osDir;
my $osFlavorDir;
my $currentMinor = 10; # The most recent version we build the drivers for
if (solaris_10_or_greater() ne "yes") {
print "VMware Tools for Solaris is only available for Solaris 10 and later.\n";
return 'no';
}
if ($minor < $currentMinor) {
$osDir = $minor;
} else {
$osDir = $currentMinor;
}
For Solaris 11.1, $minor is 11, which forces $osDir to be Solaris 10. Bug ?Either way it's very easy to fix - just change "<" to ">":
if ($minor > $currentMinor) {
Re-install Tools using the modified script and reboot. Let's check the installed driver now:
root@solaris-lab:~/vmware-tools-distrib# find /kernel/drv/ -ls |grep vmxnet3
79085 2 -rw-r--r-- 1 root root 1071 Mar 27 02:00 /kernel/drv/vmxnet3s.conf
79087 35 -rw-r--r-- 1 root root 35240 Mar 27 02:00 /kernel/drv/amd64/vmxnet3s
79086 25 -rw-r--r-- 1 root root 24672 Mar 27 02:00 /kernel/drv/vmxnet3s
root@solaris-lab:~/vmware-tools-distrib# find . -ls |grep vmxnet3
231 25 -rw-r--r-- 1 root root 24528 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s
234 2 -rw-r--r-- 1 root root 1071 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s.conf
250 2 -rw-r--r-- 1 root root 1071 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s.conf
244 25 -rw-r--r-- 1 root root 24440 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s
262 34 -rw-r--r-- 1 root root 34104 Nov 17 07:55 ./lib/modules/binary/10_64/vmxnet3s
237 35 -rw-r--r-- 1 root root 35240 Nov 17 07:55 ./lib/modules/binary/11_64/vmxnet3s
227 34 -rw-r--r-- 1 root root 34256 Nov 17 07:55 ./lib/modules/binary/2009.06_64/vmxnet3s
253 25 -rw-r--r-- 1 root root 24672 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s
259 2 -rw-r--r-- 1 root root 1071 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s.conf
Now we have the correct version installed.
Let's enable jumbo-frames as before and check if it made any difference:
root@solaris-lab:~# ping -s 192.168.1.202 9000 4
PING 192.168.1.202: 9000 data bytes
9008 bytes from 192.168.1.202: icmp_seq=0. time=0.338 ms
9008 bytes from 192.168.1.202: icmp_seq=1. time=0.230 ms
9008 bytes from 192.168.1.202: icmp_seq=2. time=0.289 ms
9008 bytes from 192.168.1.202: icmp_seq=3. time=0.294 ms
----192.168.1.202 PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss
round-trip (ms) min/avg/max/stddev = 0.230/0.288/0.338/0.044
Success! jumbo-frames are working.
Let's test throughput with iperf:
Less than 1Mb/s, not what we expected at all!
Need to take a deeper look at the packets being sent. Let's use tcpdump to create a trace file:
root@solaris-lab:~# tcpdump -w pkts.pcap -s 100 -inet1 & PID=$! ; sleep 1s ; ./iperf -t1 -c192.168.1.202; kill $PID
[1] 1726
tcpdump: listening on net1, link-type EN10MB (Ethernet), capture size 100 bytes
------------------------------------------------------------
Client connecting to 192.168.1.202, TCP port 5001
TCP window size: 48.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.206 port 35084 connected with 192.168.1.202 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.3 sec 168 KBytes 1.02 Mbits/sec
70 packets captured
70 packets received by filter
0 packets dropped by kernel
and open it in Wireshark for easier analysis:
The problem is clear with packet 7 - the driver is trying to send a 16KB packet, above our 9K MTU jumbo frame. This packet is not received outside of the VM and after a timeout it is being fragmented and retransmitted. This happens again for every packet generating a massive delay and causes throughput to be very low.
Reviewing the vmxnet3 driver source (open source at http://sourceforge.net/projects/open-vm-tools/) it seems the only way a packet larger than the MTU to be sent is if the LSO feature is enabled.
To learn more about LSO (Large Segment Offload) read http://en.wikipedia.org/wiki/Large_segment_offload.
Essentially, the kernel is sending large packets (16K in the capture) and the NIC (or virtual NIC) is supposed to fragment the packet and transmit valid-size packets. On a real hardware NIC, at high speeds, this saves considerable amounts of CPU. in a virtualized environment I don't see the benefit. And it seems to be badly broken.
Let's disable LSO:
ndd -set /dev/ip ip_lso_outbound 0
And try to run iperf again:
12.1 Gbps, SUCCESS!
Now that that we are able to transmit from Solaris out in decent rates, let's check the performance of connections into the Solaris VM:
3.74 Gbps, not bad, but we can do better - let's at least get to 10Gbps.
Next step is to tune the TCP parameters to accommodate the higher speed needed - the buffers are simply too small for the amount of data in flight -
root@solaris-lab:~# ipadm set-prop -p max_buf=4194304 tcp
root@solaris-lab:~# ipadm set-prop -p recv_buf=1048576 tcp
root@solaris-lab:~# ipadm set-prop -p send_buf=1048576 tcp
And run iperf again:
18.3 Gbps, Not bad!