Thursday, March 28, 2013

Improving VM to VM network throughput on an ESXi platform


Recently I virtualized most of the servers I had at home into an ESXi 5.1 platform. This post would follow my journey to achieve better network performance between the VMs.

I am quite happy with the setup as it allowed me to eliminate 5-6 physical boxes in favor of one (very strong) machine. I was also able to achieve some performance improvements  but not to the degree I hoped to see.

I have a variety of machines running in a virtualized form:
1. Windows 8 as my primary desktop, passing dedicated GPU and USB card.from the host to the VM using VMDirectPath
2. Multiple Linux servers
3. Solaris 11.1 as NAS, running the great napp-it software (http://www.napp-it.org/) 

All the machines have the latest VMware Tools installed and running paravirtualized drivers where possible.

VM to VM network performance has been great between the Windows/Linux boxes once I enabled Jumbo Frames. 
Throughout this post I'll use iperf to measure network performance. It's a great and easy to use tool and you can find precompiled version for almost any operating system. http://iperf.fr/

Let's start with an example of network throughput performance from the Windows 8 Machine to Linux:










11.3 Gbps, not bad. CPU utilization was around 25% on the windows box throughout the test.

Network performance between the Solaris VM and any other machine on the host was relatively bad. 
I started by using the E1000G virtual adapter, as recommended by VMware for Solaris 11 (http://kb.vmware.com/kb/2032669). We'll use one of my Linux VMs (at 192.168.1.202) as a server for these tests. using iperf to test:















1.36 Gbps. Not bad between physical servers, but unacceptable between VMs on the same host. also notice the very high CPU utilization during the test - around 80% system time.

My immediate instinct was to enable jumbo frames. Although the adapter driver is supposed to support jumbo frames, I was unable to enable it no matter how hard I fought it. 


root@solaris-lab:/kernel/drv# dladm set-linkprop -p mtu=9000 net0
dladm: warning: cannot set link property 'mtu' on 'net0': link busy

I gave up on getting better performance from the E1000G adapter and switched to VMXNET3. I immediately saw improvement:















2.31 Gbps. but more importantly, the cpu utilization was much lower.

Now let's try to enable jumbo frames for the vmxnet3 adapter - followed the steps in http://kb.vmware.com/kb/2012445 and http://kb.vmware.com/kb/2032669 without success. The commands work, but jumbo frames were not really enabled. we can test with 9000 byte ping -

root@solaris-lab:~# ping -s 192.168.1.202 9000 4
PING 192.168.1.202: 9000 data bytes
----192.168.1.202 PING Statistics----
4 packets transmitted, 0 packets received, 100% packet loss


As my next step I was planning on running some dtrace commands, and accidentally noticed that the drivers I have installed are the Solaris 10 version and not the Solaris 11 version.


root@solaris-lab:~/vmware-tools-distrib# find /kernel/drv/ -ls |grep vmxnet3
78669    2 -rw-r--r--   1 root     root         1071 Mar 27 01:42 /kernel/drv/vmxnet3s.conf
78671   34 -rw-r--r--   1 root     root        34104 Mar 27 01:42 /kernel/drv/amd64/vmxnet3s
78670   25 -rw-r--r--   1 root     root        24440 Mar 27 01:42 /kernel/drv/vmxnet3s
root@solaris-lab:~/vmware-tools-distrib# find . -ls |grep vmxnet3
  231   25 -rw-r--r--   1 root     root        24528 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s
  234    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s.conf
  250    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s.conf
  244   25 -rw-r--r--   1 root     root        24440 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s
  262   34 -rw-r--r--   1 root     root        34104 Nov 17 07:55 ./lib/modules/binary/10_64/vmxnet3s
  237   35 -rw-r--r--   1 root     root        35240 Nov 17 07:55 ./lib/modules/binary/11_64/vmxnet3s
  227   34 -rw-r--r--   1 root     root        34256 Nov 17 07:55 ./lib/modules/binary/2009.06_64/vmxnet3s
  253   25 -rw-r--r--   1 root     root        24672 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s
  259    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s.conf


This is very strange as installation of the Tools is a straightforward procedure with no room for user error.

So I decided to open the Tools installation script (perl) and found an interesting bug -


...
sub configure_module_solaris {
  my $module = shift;
  my %patch;
  my $dir = db_get_answer('LIBDIR') . '/modules/binary/';
  my ($major, $minor) = solaris_os_version();
  my $os_name = solaris_os_name();
  my $osDir;
  my $osFlavorDir;
  my $currentMinor = 10;   # The most recent version we build the drivers for

  if (solaris_10_or_greater() ne "yes") {
    print "VMware Tools for Solaris is only available for Solaris 10 and later.\n";
    return 'no';
  }

  if ($minor < $currentMinor) {
    $osDir = $minor;
  } else {
    $osDir = $currentMinor;
  }
For Solaris 11.1, $minor is 11, which forces $osDir to be Solaris 10. Bug ?
Either way it's very easy to fix - just change "<" to ">":

if ($minor > $currentMinor) {

Re-install Tools using the modified script and reboot. 
Let's check the installed driver now:



root@solaris-lab:~/vmware-tools-distrib# find /kernel/drv/ -ls |grep vmxnet3
79085    2 -rw-r--r--   1 root     root         1071 Mar 27 02:00 /kernel/drv/vmxnet3s.conf
79087   35 -rw-r--r--   1 root     root        35240 Mar 27 02:00 /kernel/drv/amd64/vmxnet3s
79086   25 -rw-r--r--   1 root     root        24672 Mar 27 02:00 /kernel/drv/vmxnet3s
root@solaris-lab:~/vmware-tools-distrib# find . -ls |grep vmxnet3
  231   25 -rw-r--r--   1 root     root        24528 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s
  234    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/2009.06/vmxnet3s.conf
  250    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s.conf
  244   25 -rw-r--r--   1 root     root        24440 Nov 17 07:55 ./lib/modules/binary/10/vmxnet3s
  262   34 -rw-r--r--   1 root     root        34104 Nov 17 07:55 ./lib/modules/binary/10_64/vmxnet3s
  237   35 -rw-r--r--   1 root     root        35240 Nov 17 07:55 ./lib/modules/binary/11_64/vmxnet3s
  227   34 -rw-r--r--   1 root     root        34256 Nov 17 07:55 ./lib/modules/binary/2009.06_64/vmxnet3s
  253   25 -rw-r--r--   1 root     root        24672 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s
  259    2 -rw-r--r--   1 root     root         1071 Nov 17 07:55 ./lib/modules/binary/11/vmxnet3s.conf

Now we have the correct version installed. 

Let's enable jumbo-frames as before and check if it made any difference:

root@solaris-lab:~# ping -s 192.168.1.202 9000 4
PING 192.168.1.202: 9000 data bytes
9008 bytes from 192.168.1.202: icmp_seq=0. time=0.338 ms
9008 bytes from 192.168.1.202: icmp_seq=1. time=0.230 ms
9008 bytes from 192.168.1.202: icmp_seq=2. time=0.289 ms
9008 bytes from 192.168.1.202: icmp_seq=3. time=0.294 ms
----192.168.1.202 PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 0.230/0.288/0.338/0.044

Success! jumbo-frames are working.


Let's test throughput with iperf:















Less than 1Mb/s, not what we expected at all!
Need to take a deeper look at the packets being sent. Let's use tcpdump to create a trace file:

root@solaris-lab:~# tcpdump -w pkts.pcap -s 100 -inet1 & PID=$! ; sleep 1s ; ./iperf -t1 -c192.168.1.202; kill $PID
[1] 1726
tcpdump: listening on net1, link-type EN10MB (Ethernet), capture size 100 bytes
------------------------------------------------------------
Client connecting to 192.168.1.202, TCP port 5001
TCP window size: 48.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.206 port 35084 connected with 192.168.1.202 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.3 sec    168 KBytes  1.02 Mbits/sec
70 packets captured
70 packets received by filter
0 packets dropped by kernel

and open it in Wireshark for easier analysis:












The problem is clear with packet 7 - the driver is trying to send a 16KB packet, above our 9K MTU jumbo frame. This packet is not received outside of the VM and after a timeout it is being fragmented and retransmitted. This happens again for every packet generating a massive delay and causes throughput to be very low.

Reviewing the vmxnet3 driver source (open source at http://sourceforge.net/projects/open-vm-tools/) it seems the only way a packet larger than the MTU to be sent is if the LSO feature is enabled. 
To learn more about LSO (Large Segment Offload) read http://en.wikipedia.org/wiki/Large_segment_offload.
Essentially, the kernel is sending large packets (16K in the capture) and the NIC (or virtual NIC) is supposed to fragment the packet and transmit valid-size packets. On a real hardware NIC, at high speeds, this saves considerable amounts of CPU. in a virtualized environment I don't see the benefit. And it seems to be badly broken.

Let's disable LSO:

ndd -set /dev/ip ip_lso_outbound 0

And try to run iperf again:














12.1 Gbps, SUCCESS!

Now that that we are able to transmit from Solaris out in decent rates, let's check the performance of connections into the Solaris VM:









3.74 Gbps, not bad, but we can do better - let's at least get to 10Gbps.

Next step is to tune the TCP parameters to accommodate the higher speed needed - the buffers are simply too small for the amount of data in flight -


root@solaris-lab:~# ipadm set-prop -p max_buf=4194304 tcp
root@solaris-lab:~# ipadm set-prop -p recv_buf=1048576 tcp
root@solaris-lab:~# ipadm set-prop -p send_buf=1048576 tcp

And run iperf again:













18.3 Gbps, Not bad!

58 comments:

  1. Hey, cool write up! :-)

    I think there is an error in the code that detects the solaris version.

    This line:

    if ($minor < $currentMinor) {

    should be:
    if ($minor > $currentMinor) {

    /Jannich

    ReplyDelete
    Replies
    1. Thank you for the comment, I've fixed the typo.

      Delete
  2. I always wanted to put my solaris box on the ESXI, but the abysmal network performance killed it.

    This is super helpful!

    ReplyDelete
  3. FYI, when testing pinging with jumbo frames, you need to allow 28 bytes for the IP and ICMP headers, try 8972 instead of 9000 on your ping command ;-)

    ReplyDelete
  4. Great writeup, would love to see a post with more specifics on your hardware setup.

    ReplyDelete
  5. TCP performance is related to delay (RTT) and TCP window size (if we don't have any loss). You should use -w in iperf to define window size and remake the same test on windows and linux PC

    ReplyDelete
  6. Great write up and shows amazing results.. Thanks for spending your time to provide such information greatly appreciated..

    ReplyDelete
  7. You could enable jumbo frame on e1000g vnic by changing MaxFrameSize in /kernel/drv/e1000g.conf

    Default:
    MaxFrameSize=0,0,0,0 ...
    Change to:
    MaxFrameSize=3,3,3,3 ...

    Reboot.

    -----

    Anyway, thanks for ipadm set-prop tips.
    Regards.

    ReplyDelete
  8. The Solaris vmxnet3 driver has so many problems:

    - the LSO problem (there's a source patch that claims to fix the problem, but I haven't tried it: http://www.mail-archive.com/open-vm-tools-devel@lists.sourceforge.net/msg00812.html)

    - the garbage debug output printed to console

    - requiring the ndd 'accept-jumbo' flag to be set before MTU can be changed (why?!?!)

    I'm seriously thinking about taking the source from open-vm-tools, throwing it on github, and fixing these problems

    ReplyDelete
  9. In the latest update 5.1 u1, the grabage output in the driver is fixed - after updating to the latest vmware tools, ndd accept-jumbo I was not able to test, same with lso, but will check and report back !

    ReplyDelete
  10. Can you pls help understand the latter part of the sentence "On a real hardware NIC, at high speeds, this saves considerable amounts of CPU. in a virtualized environment I don't see the benefit"
    i.e. why do you say there is no benefit?

    thanks

    ReplyDelete
  11. Sure - on a real hardware NIC the segmentation is done in the NIC itself, meaning the host needs to build far less packets, calculate CRC, etc. the work of building the packet headers takes CPU on the host, and when offloaded to the hardware, can save can save considerable load. Now, in a virtualized environment, there is no physical hardware to build headers, but it is simulated by the ESXi host, still taking CPU. you are shifting load from the VM to the Host, but in total don't save any computation done on the main CPU.

    ReplyDelete
  12. How were you able to hit speeds OVER 10Gbits/sec when the drivers itself is a 10Gb driver??
    Please let me know, I've been testing with the vmxnet3 drivers in windows and linux.

    ReplyDelete
    Replies
    1. Nothing is limiting the driver to 10Gb. It can go much faster.

      Delete
  13. Nice post. But I don't understand why in my esxi 5.1 u1 lab, the iperf speed between my Windows Sever 2003 and a Linux is about 300Mbit/s, very pool. I didn't enable Jumbo frame, but those two have vmxnet3 driver installed.
    Can somebody shed some lights for troubleshooting this?

    ReplyDelete
    Replies
    1. I'd be happy to help.
      first step is to isolate the direction - are you getting 300Mb/s in both directions (linux->win and win->linux) ?
      did you try to take a tcpdump and look at the results? I'd be happy to examine it for you.

      Delete
  14. Hi Cyber Explorer,
    Can you share some tips for tuning Linux and Windows network performance if there is any?

    My machine is not strong ( AMD 1.6Ghz x 2 ), but currently I get only 500Mbit/s transfer at max, no matter what OS it is, Linux, Windows, FreeBSD or Solaris. Is it normal?

    ReplyDelete
    Replies
    1. my hunch is that you can do better. I don't have experience with these specific CPUs, but I have a standalone server (not virtualized) based on Intel Atom D510 and I am able to exhaust it's 1 GbE port after network tuning. I believe the D510 is weaker than your AMD cpus.

      Delete
    2. Today I tried to locate the problem. I have Windows Server 2003 R2 vm as a domain controller. And first I tried the iperf with the loop address to exclude other impacts. And the result is:
      C:\iperf-2.0.5-2-win32>iperf.exe -c127.0.0.1
      ------------------------------------------------------------
      Client connecting to 127.0.0.1, TCP port 5001
      TCP window size: 64.0 KByte (default)
      ------------------------------------------------------------
      [ 3] local 127.0.0.1 port 3004 connected with 127.0.0.1 port 5001
      [ ID] Interval Transfer Bandwidth
      [ 3] 0.0-10.0 sec 761 MBytes 637 Mbits/sec

      C:\iperf-2.0.5-2-win32>iperf.exe -w 125K -c127.0.0.1
      ------------------------------------------------------------
      Client connecting to 127.0.0.1, TCP port 5001
      TCP window size: 125 KByte
      ------------------------------------------------------------
      [ 3] local 127.0.0.1 port 3005 connected with 127.0.0.1 port 5001
      [ ID] Interval Transfer Bandwidth
      [ 3] 0.0-10.0 sec 814 MBytes 682 Mbits/sec

      It would eat up 80% of CPU usage, so maybe it's a CPU problem, isn't it?

      Delete
    3. do you have jumbo frames enabled?

      Delete
    4. Nope. Later I found that if I change window size to 1M, I can get 900Mbit/s throughput for loop address on Windows 2003. But I think it's Windows problem, because when testing iperf on my OpenIndiana vm through loop address, I can get 7Gbit/s, which is satisfying.

      Now what can I do next? Since I still cannot get a satisfying speed between VMs under the same port group.

      Delete
    5. so that's the problem - enable jumbo frames on the windows box and you should get a much better throughput with lower cpu.

      Delete
    6. Since the physical switch which connects to the esxi server doesn't support jumbo frames, can I enable it on esxi? Will it influence the physical clients?

      Delete
    7. Good question, I'm not sure. best would be to simply try...

      Delete
  15. USB passthrough is broken in 5.1. How were you able to pass GPU AND USB through on 5.1?? After doing your tuning, did you notice any other bugs with the vmxnet3 adapter?

    ReplyDelete
    Replies
    1. I was able to stabilize USB pass-through on one specific version, any upgrade I attempted broke it and I had to roll back. It's ESXi 5.1.0 1021289.
      Same goes for the USB PCI-E adapter: tried a few and only one worked - http://www.amazon.com/gp/product/B005ARQV6U?ie=UTF8&camp=213733&creative=393185&creativeASIN=B005ARQV6U&linkCode=shr&tag=cybeblog-20&psc=1

      Delete
    2. As for other bugs - other than the known bug that's just making noise in the log, it's been extremely stable.

      Delete
  16. by the way, can you please show us your whitebox's spec, so we can use it as a reference?
    Thanks

    ReplyDelete
  17. I've been trying to break past an 800MB/sec nfs bottleneck between two linux guests on the same system using a standard vswitch with no physical nics assigned. i even tried bonding two different vswitches without any improvement. i also tried tweaking the various tcp window sizes. MTU is set to 9k, and tcpdump output appears to validate that:


    16:44:13.639338 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126259665:126268613, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948
    16:44:13.639347 IP cgx.51236 > nfsa.commplex-link: Flags [P.], seq 126268613:126271513, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 2900
    16:44:13.639355 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126271513:126280461, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948
    16:44:13.639359 IP nfsa.commplex-link > cgx.51236: Flags [.], ack 126241769, win 21993, options [nop,nop,TS val 78849579 ecr 74413494], length 0
    16:44:13.639376 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126280461:126289409, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948
    16:44:13.639386 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126289409:126298357, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948
    16:44:13.639387 IP nfsa.commplex-link > cgx.51236: Flags [.], ack 126259665, win 21993, options [nop,nop,TS val 78849579 ecr 74413494], length 0
    16:44:13.639393 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126298357:126307305, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948
    16:44:13.639400 IP nfsa.commplex-link > cgx.51236: Flags [.], ack 126280461, win 21927, options [nop,nop,TS val 78849579 ecr 74413494], length 0
    16:44:13.639405 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126307305:126316253, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948
    16:44:13.639416 IP cgx.51236 > nfsa.commplex-link: Flags [P.], seq 126316253:126320665, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 4412
    16:44:13.639426 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126320665:126329613, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948
    16:44:13.639435 IP nfsa.commplex-link > cgx.51236: Flags [.], ack 126298357, win 21927, options [nop,nop,TS val 78849579 ecr 74413494], length 0
    16:44:13.639435 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126329613:126338561, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948
    16:44:13.639444 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126338561:126347509, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948
    16:44:13.639453 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 126347509:126356457, ack 1, win 140, options [nop,nop,TS val 74413494 ecr 78849579], length 8948



    but the iperf numbers are not very good.

    using the default frame sizes:
    TCP window size: 29.0 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.0.0.2 port 34335 connected with 10.0.0.1 port 5001
    [ 3] 0.0-10.0 sec 5.20 GBytes 4.47 Gbits/sec

    Server listening on TCP port 5001
    TCP window size: 85.3 KByte (default)
    ------------------------------------------------------------
    [ 4] local 10.0.0.1 port 5001 connected with 10.0.0.2 port 34335
    [ 4] 0.0-10.0 sec 5.20 GBytes 4.46 Gbits/sec


    using tweaked window sizes of 640KB etc, showed no difference.

    the same issue is on both esxi 5.1.0 build 1065491 and 799733.

    just can't seem to get pas the 4.5Gb/sec.

    anyone have any thoughts?


    ReplyDelete
    Replies
    1. wanted to add that the cpu usage is minimal. we are using the vmxnet3 drivers. i've tried various versions (the 1.1.18 through 1.1.29 and 1.1.32). also tried disabling LRO in the vmware settings (as some other web searches suggested), and the gso/tso/lro in the guest. none of the various combinations make any difference.

      Delete
    2. can you paste a longer tcpdump output, maybe through http://pastebin.com/ ?
      Please include the tcp handshake, I want to see the window scale parameters.

      My gut feeling is that you are exhausting your rcv buffers, although that should not kick in before 15-20Gbps on modern linux kernels and hardware.

      Delete
    3. this is from the first part of the tcpdump. i used the command you indicated earlier. i didn't want to post the whole thing. the section I posted earlier was from the middle. If you tell me what keywords you need (assuming this isn't it) I will look for them. here also is the sysctl settings I used which result in the same performance. I'm a little stumped because you said your linux2windows performance was fine. this is linux2linux (centos 6.2 - 2.6.32-220 to a 2.6.32-400).

      net.core.rmem_max = 16777216
      net.core.wmem_max = 16777216
      ##net.core.rmem_default = 33554432
      ##net.core.wmem_default = 33554432
      net.ipv4.tcp_mem = 16777216 16777216 16777216
      net.ipv4.tcp_rmem = 4096 873800 16777216
      net.ipv4.tcp_wmem = 4096 655360 16777216
      #net.ipv4.tcp_wmem = 4096 8738000 16777216
      net.core.netdev_max_backlog = 30000
      vm.min_free_kbytes = 2097152


      and the beginning of the tcpdump...

      16:44:13.416153 IP cgx.51236 > nfsa.commplex-link: Flags [S], seq 1689649426, win 17920, options [mss 8960,sackOK,TS val 74413271 ecr 0,nop,wscale 7], length 0
      16:44:13.416318 IP nfsa.commplex-link > cgx.51236: Flags [S.], seq 3605531965, ack 1689649427, win 17896, options [mss 8960,sackOK,TS val 78849356 ecr 74413271,nop,wscale 7], length 0
      16:44:13.416358 IP cgx.51236 > nfsa.commplex-link: Flags [.], ack 1, win 140, options [nop,nop,TS val 74413271 ecr 78849356], length 0
      16:44:13.416397 IP cgx.51236 > nfsa.commplex-link: Flags [P.], seq 1:25, ack 1, win 140, options [nop,nop,TS val 74413271 ecr 78849356], length 24
      16:44:13.416425 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 25:8973, ack 1, win 140, options [nop,nop,TS val 74413271 ecr 78849356], length 8948
      16:44:13.416585 IP nfsa.commplex-link > cgx.51236: Flags [.], ack 25, win 140, options [nop,nop,TS val 78849356 ecr 74413271], length 0
      16:44:13.416597 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 8973:17921, ack 1, win 140, options [nop,nop,TS val 74413271 ecr 78849356], length 8948
      16:44:13.416632 IP nfsa.commplex-link > cgx.51236: Flags [.], ack 8973, win 272, options [nop,nop,TS val 78849356 ecr 74413271], length 0
      16:44:13.416647 IP cgx.51236 > nfsa.commplex-link: Flags [P.], seq 17921:26869, ack 1, win 140, options [nop,nop,TS val 74413271 ecr 78849356], length 8948
      16:44:13.416667 IP cgx.51236 > nfsa.commplex-link: Flags [.], seq 26869:35817, ack 1, win 140, options [nop,nop,TS val 74413271 ecr 78849356], length 8948
      16:44:13.416667 IP nfsa.commplex-link > cgx.51236: Flags [.], ack 17921, win 227, options [nop,nop,TS val 78849357 ecr 74413271], length 0

      Delete
    4. these are the original defaults on one of the nodes (before any sysctl changes). we have two different physical hosts. they both have the same guests. we've been experimenting with changing the sysctl settings on one to see what effect it has have and kept the other physical host's guests the same as the defaults.

      [root@cgx1 /]# sysctl -a | grep rcv
      net.ipv4.tcp_moderate_rcvbuf = 1
      [root@cgx1 /]# sysctl -a | grep recv
      [root@cgx1 /]# sysctl -a | grep rmem
      net.core.rmem_max = 131071
      net.core.rmem_default = 124928
      net.ipv4.tcp_rmem = 4096 87380 4194304
      net.ipv4.udp_rmem_min = 4096


      so far, both are performing exactly the same.

      Delete
    5. the tcpdump command you are using is correct - but i need a longer snippet to try and understand the flow of packates, at least a few thousands lines. best would be to paste it at a site like http://pastebin.com/, and then reply with the link to the paste here.
      Also, it's much easier to troubleshoot with iperf than NFS (or anything else). can you run the tcpdump with iperf traffic?
      One more request - run tcpdump on both nodes concurrently when you test with iperf and attach the log.

      Delete
    6. I did run it with iperf. I used EXACTLY the command you gave above in your original post.

      i ran it again to get both client and server

      server: http://pastebin.com/m5vXBv0U

      client: http://pastebin.com/EHQAW0Lr

      Delete
  18. Sorry for taking the time to answer, I've been traveling for work.
    The traces are very interesting. Nothing in the packet flow/congestion seems to limit the performance at all.
    The issue is the rate you are sending packets out - once every 7-10uS. My linux boxes send packets out every 1-2uS, until they reach a buffer/bandwidth bottleneck at around 22 Gbps.

    anything on your system that can limit sending packets? for example, is it a multi core cpu with one core pegged at 100% ?
    maybe (but unlikely) a packet shaper or firewall on the machine?

    ReplyDelete
    Replies
    1. firewall is chkconfiged off (iptables and ip6tables)

      8 vcpu's are assigned. the physical host is dual cpu each cpu has 6 hyperthreaded cores. i don't see any single cpu limited.

      as far as i know, no packet shaper is on. it is a stock centos 6.2 install on one, and an stock 6.2 with upgrades to 2.6.32-400 on the other (for ocfs2).

      Delete
    2. I'll try to install over the weekend the same configuration and see what performance i get. will let you know what i find.

      Delete
    3. I used a livecd for centos 6.2, straight out of the box installed iperf, enabled mtu 9000 and getting 15-20Gbps... nothing installed or touched on the box beyond that - vanilla drivers, not even vmware tools drivers.
      before enabling mtu 9000 I was exactly the exact problem as you were seeing.

      can you please post the output of -
      ethtool -i eth0
      ifconfig eth0

      Delete
    4. this is the output for eth1. eth0 has a physical nic associated with it for management. eth1 is the purely virtual guest-2-guest vswitch.

      [root@cgx1 ~]# ethtool -i eth1
      driver: vmxnet3
      version: 1.1.18.0-k-NAPI
      firmware-version: N/A
      bus-info: 0000:1b:00.0
      [root@cgx1 ~]# ifconfig eth1
      eth1 Link encap:Ethernet HWaddr 00:0C:29:7E:CD:6E
      inet addr:10.0.0.1 Bcast:10.0.0.255 Mask:255.255.255.0
      inet6 addr: fe80::20c:29ff:fe7e:cd6e/64 Scope:Link
      UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
      RX packets:13255527 errors:0 dropped:149 overruns:0 frame:0
      TX packets:16059506 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 txqueuelen:1000
      RX bytes:5981766938 (5.5 GiB) TX bytes:159450144144 (148.4 GiB)

      Delete
    5. that is identical to what i see on mine, but i'm getting much better throughput.
      one way to continue troubleshooting, would be to create another VM, run centos 6.2 live cd, install perf, run ifconfig eth0 mtu 9000 and test. if this shows good performance, you need to understand the delta between the livecd and your instance.

      Delete
  19. I have a question about the version bug in Solaris where vmtools installs for version 10 instead of 11. Is this still in the code cause i cant find the line to change. The problem is that i also have version 10 installed and when i do "uname -r" i get 5.11 meaning i should have version 11 installed. Do you mind telling us where (if still exists) in the code we need to change the "<" and ">"?

    ReplyDelete
    Replies
    1. what version of ESXi do you use? can you paste the installation script (or at least the relevant parts) over at http://pastebin.com and reply with the url?

      Delete
    2. Ahh finally found it, thought it was in the "vmware-install.pl" but it's actually in a subfolder "./bin/vmware-config-tools.pl". Hopefully this will help someone =) and thx for very useful blogpost!

      Delete
  20. I have never, ever.....gotten vmxnet3 to perform correctly in esx aio. NFS share disconnects from ESX all the time, no matter what the MTU. e1000g just works. This is the same in ESXi 5.5

    ReplyDelete
  21. I'm trying to boost my speed from my Win7 VM to my Ubuntu12.04 VM and I can't seem to get past 2Gbps in either direction. From what I can tell I have Jumbo Frames enabled on both and I'm using the VMXNET3 driver on each. Link states 10gbps on each. When I just run the default client and servers I'm stuck at 2Gbps but if I explicitly set the window size with the -w flag on each I can get up to about 6Gbps. My questions are: if TCP is configured properly shouldn't the window size scale automatically without being set? When I transfer files over SMB/CIFS I can't seem to move past the 2Gbps limit either (I've got an SSD capable of going faster)

    ReplyDelete
  22. Well, I'm at about 4Gbps with iperf after realizing that the splashtop streamer I was running to remotely connect to the VM was slowing the iperf results. But the RWIN still doesn't seem to be auto scaling enough. It will go up to 10gbps if I set window size manually or connect multiple instances with -P but should I have to do that?

    ReplyDelete
  23. In my experience it should not make a difference. are the results identical if you use the windows as servers vs. linux as server?

    ReplyDelete
    Replies
    1. Yeah, the results are basically the same regardless of which is the server and which is the client. There's definitely something on the Win7 box thats keeping the TCP window from scaling on its own correctly. When I use iPerf between Linux clients the default window size appears to adjust to something in the 900K to the 1M range when the client side connects to the server. With Win7 the default always maxes out at 64K and the speed would indicate that it's not scaling beyond that. Only way I can get it to go faster is to explicitly set the Window size higher in iPerf or use multiple TCP streams. There's got to be a config on the Win7 side I'm missing that allows RWIN to auto scale above 64K.

      Delete
    2. do you have a packet capture i can look at?
      IIRC win 7 have some tuning options for the tcp stack - are they in their default values?

      Delete
    3. Here's a pcap with LSO on:
      https://drive.google.com/file/d/0B7vCQgqzBIZ7Q3poMmY0MUs5WnM/edit?usp=sharing
      And LSO off:
      https://drive.google.com/file/d/0B7vCQgqzBIZ7QjVKMkU3OXIxX28/edit?usp=sharing

      I've used TCPOptimizer to implement better values for Win7, did give about a 1Gbps increase in speed, but nowhere near 10Gbps total

      Delete
    4. in your capture, 10.10.1.130 has Windows Scaling disabled, which means it can only grow to 65K. i assume this is the windows 7 machine ? It would explain why RWIN doesn't scale beyond 65K and impacts performance.
      Windows 7 should have TCP WS enabled by default, something must have disabled it on your machine. take a look here http://datacomguy.blogspot.com/2011/06/tweaking-windows-7-vista-tcpip-settings.html

      try manually changing autotuninglevel to normal or experimental (netsh int tcp set global autotuninglevel=normal)

      Delete
  24. Nice write-up! This led me to the discovery that my Oracle Solaris 11.1 VM was running the older Solaris 10 vmxnet3 NIC drivers.

    But after replacing the drivers and rebooting, the max MTU is still only 1500.

    I noticed that the ping command you're using to test jumbo frames might be flawed. It's missing the parameter that sets the DF-bit on the echo-request packets.

    Under Oracle Solaris 11.1, the following ping command will test jumbo frames.
    ping -s -D 192.168.1.202 8972 4

    One more thing, would you mind posting the output of 'dladm show-linkprop -p mtu' from your Solaris 11.1 guest VM?

    ReplyDelete
  25. I too have the same problem as Pooch that I can't seem to get past 2Gbps (default window size) VM-to-VM (both VMs are Win 7 and I have assigned each with 4 cpu). One thing for sure, the speed seems to affected by how many cpu each VM has (if I assign each VM with 1 cpu, the speed drop to <100Mbps).

    Hi Cyber Explorer, is it ok for you to show us your Host machine spec? (you have dual 10Gb ethernet onboard or PCIe adapter, CPU model, how much RAM, etc) As well as your VM setup? (you use vmxnet3, Virtual Machine Version 8 or 9 or 10, how many CPU assigned to each VM, memory, etc). In additional, do you use vsphere "client" or "web client" (with vcenter installed) to create those VM?

    I am trying to follow your setup and see if I can get close to ~20Gbps and that would be extremely awesome!

    Here is some test result and he too get around 2Gbps (default window size) as well
    http://forums.freenas.org/threads/esxi-5-5-network-performance-comparison-with-vmxnet-and-intel-em.15320/

    ReplyDelete
  26. This is common in any communications protocol. In TCP / IP (TCP Offload, Full Kernel Bypass) each of these units of information called "datagram" (datagram), and data sets are sent as separate messages.
    Thanks for sharing nice blog....

    ReplyDelete
  27. Thanks for approving my comment.
    It is extremely interesting for me to read the article. Thank you for it. I like such topics and everything connected to them. I would like to read a bit more on that blog soon.
    10G bit tcp offload

    ReplyDelete
  28. Full TCP offload engine works best with 10 gigabytes Ethernet network adapters. 10G bit TCP offload technology designed for financial institutions like banks, data centers, stock exchanges etc.
    NIC with Full TCP/UDP offload

    Thanks..

    ReplyDelete