TCP Throughput

TCP Throughput <= Bytes in flight / RTT, where RTT = round-trip time. Max bytes in flight = min(Cwnd, Rwnd, sndbuf, BDP).

TCP trace segment graph

tcptrace is a tool written by Shawn Ostermann at Ohio University, http://tcptrace.org. Wireshark can produce nice interactive graphs.

FreeBSD

Here we show FreeBSD's NewReno congestion control, slow start after packet loss.

Mininet

All graphs below are from Linux using CUBIC cc, running under Mininet.

mininet> net
h1 h1-eth0:s1-eth1
h2 h2-eth0:s1-eth2
s1 lo:  s1-eth1:h1-eth0 s1-eth2:h2-eth0

mininet

Packets are captured at sender side tcpdump -i s1-eth1 -s 128.

mininet> h1 ping -c 4 h2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.252 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.053 ms
64 bytes from 10.0.0.2: icmp_seq=3 ttl=64 time=0.059 ms
64 bytes from 10.0.0.2: icmp_seq=4 ttl=64 time=0.056 ms
mininet> xterm h2  # then start `iperf3 -s` on h2

mininet> h1 iperf3 -c h2
Connecting to host 10.0.0.2, port 5201
[  5] local 10.0.0.1 port 37680 connected to 10.0.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.00 GBytes  68.7 Gbits/sec    0    666 KBytes
[  5]   1.00-2.00   sec  8.20 GBytes  70.4 Gbits/sec    0    940 KBytes
[  5]   2.00-3.00   sec  7.89 GBytes  67.7 Gbits/sec    0   1.15 MBytes
[  5]   3.00-4.00   sec  7.90 GBytes  67.9 Gbits/sec    0   1.28 MBytes
[  5]   4.00-5.00   sec  7.98 GBytes  68.6 Gbits/sec    0   1.42 MBytes
[  5]   5.00-6.00   sec  8.16 GBytes  70.1 Gbits/sec    0   1.70 MBytes
[  5]   6.00-7.00   sec  8.17 GBytes  70.2 Gbits/sec    0   1.70 MBytes
[  5]   7.00-8.00   sec  8.14 GBytes  69.9 Gbits/sec    0   1.79 MBytes
[  5]   8.00-9.00   sec  8.04 GBytes  69.0 Gbits/sec    0   1.79 MBytes
[  5]   9.00-10.00  sec  7.99 GBytes  68.7 Gbits/sec    0   1.88 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  80.5 GBytes  69.1 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  80.5 GBytes  69.1 Gbits/sec                  receiver

iperf Done.

Add 100ms latency using netem delay 100ms.

mininet> s1 tc qdisc replace dev s1-eth2 root netem delay 100ms

mininet> h1 ping -c 4 h2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=101 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=100 ms
64 bytes from 10.0.0.2: icmp_seq=3 ttl=64 time=100 ms
64 bytes from 10.0.0.2: icmp_seq=4 ttl=64 time=100 ms

Slow sender

iperf3 -c server --bitrate 10M

sender

Slow sender using FQ pacing

iperf3 -c server --fq-rate 10M

sender-fq

No bursts.

Slow receiver

iperf3 -s --server-bitrate-limit 10M

receiver

Small window size.

Small Rwnd

mininet> h1 bin/tcpperf -c h2
Connected 10.0.0.1:37662 -> 10.0.0.2:2009, congestion control: cubic
Time (s)  Throughput   Bitrate    Cwnd    Rwnd  sndbuf  ssthresh  rtt/var
  0.000s   0.00kB/s   0.00kbps  14.1Ki  42.4Ki  85.3Ki    2048Mi  201.2ms/100593
  1.048s   21.0MB/s    168Mbps  10.5Mi  6094Ki  16.0Mi    2048Mi  100.2ms/56  retrans=3
  2.050s   65.2MB/s    522Mbps  16.5Mi  6518Ki  16.0Mi    2048Mi  100.4ms/107
  3.054s   66.6MB/s    533Mbps  16.5Mi  6546Ki  16.0Mi    2048Mi  100.4ms/80
  4.058s   66.7MB/s    534Mbps  16.5Mi  6520Ki  16.0Mi    2048Mi  100.5ms/61
  5.063s   66.7MB/s    533Mbps  16.5Mi  6546Ki  16.0Mi    2048Mi  100.5ms/80
  6.066s   66.8MB/s    534Mbps  16.5Mi  6520Ki  16.0Mi    2048Mi  100.4ms/81
  7.070s   66.6MB/s    533Mbps  16.5Mi  6546Ki  16.0Mi    2048Mi  100.4ms/69
  8.074s   66.8MB/s    534Mbps  16.5Mi  6520Ki  16.0Mi    2048Mi  100.4ms/68
  9.077s   66.9MB/s    535Mbps  16.5Mi  6552Ki  16.0Mi    2048Mi  100.3ms/77
 10.081s   66.7MB/s    534Mbps  16.5Mi  6548Ki  16.0Mi    2048Mi  100.4ms/115
Transferred 623MBytes in 10.238s, 4754 syscalls, 131072.0 Bytes/syscall

Throughput is limited by Rwnd (snd_wnd), 100ms * 66.7MB/s = 6.5MB.

recv

Window is filled up as soon as advertised.

Set larger tcp_rmem on receiver, for larger Rwnd.

mininet> h2 sysctl -A |grep tcp_.mem
net.ipv4.tcp_rmem = 10240   87380   16777216
net.ipv4.tcp_wmem = 10240   87380   16777216
mininet> h2 sysctl -A |grep tcp_adv
net.ipv4.tcp_adv_win_scale = 1
mininet> h2 sysctl -w net.ipv4.tcp_rmem="10240 131072 65536000"
net.ipv4.tcp_rmem = 10240 131072 65536000

For net.ipv4.tcp_adv_win_scale = 1, Rwnd = tcp_rmem[2] / 2 = 32MB.

Small sndbuf

mininet> h1 bin/tcpperf -c h2
Connected 10.0.0.1:45092 -> 10.0.0.2:2009, congestion control: cubic
Time (s)  Throughput   Bitrate    Cwnd    Rwnd  sndbuf  ssthresh  rtt/var
  0.000s   0.00kB/s   0.00kbps  14.1Ki  42.4Ki  85.3Ki    2048Mi  100.7ms/50326
  1.045s   18.3MB/s    147Mbps  10.2Mi  8474Ki  16.0Mi    2048Mi  103.6ms/6137  retrans=2
  2.034s    148MB/s   1184Mbps  31.4Mi  31.2Mi  16.0Mi    2048Mi  100.0ms/20
  3.014s    159MB/s   1272Mbps  31.4Mi  31.2Mi  16.0Mi    2048Mi  100.0ms/1
  4.005s    157MB/s   1255Mbps  31.4Mi  31.2Mi  16.0Mi    2048Mi  100.0ms/5
  5.042s    158MB/s   1264Mbps  31.4Mi  31.2Mi  16.0Mi    2048Mi  100.0ms/16
  6.020s    159MB/s   1275Mbps  31.4Mi  31.2Mi  16.0Mi    2048Mi  100.0ms/15
  7.016s    154MB/s   1229Mbps  31.4Mi  31.2Mi  16.0Mi    2048Mi  100.0ms/2
  8.055s    158MB/s   1261Mbps  31.4Mi  31.2Mi  16.0Mi    2048Mi  100.0ms/0
  9.019s    164MB/s   1314Mbps  31.4Mi  31.2Mi  16.0Mi    2048Mi  100.0ms/15
 10.034s    151MB/s   1206Mbps  31.4Mi  31.2Mi  16.0Mi    2048Mi  100.0ms/14
Transferred 1425MBytes in 10.134s, 10869 syscalls, 131072.0 Bytes/syscall

Throughput is limited by sndbuf, 100ms * 160MB/s = 16MB.

sndbuf

Rwnd is only half filled, in a burst.

Higher throughput achived by larger sndbuf.

mininet> h1 sysctl -w net.ipv4.tcp_wmem="10240 131072 65536000"
net.ipv4.tcp_wmem = 10240 131072 65536000

mininet> h1 bin/tcpperf -c h2
Connected 10.0.0.1:53176 -> 10.0.0.2:2009, congestion control: cubic
Time (s)  Throughput   Bitrate    Cwnd    Rwnd  sndbuf  ssthresh  rtt/var
  0.000s   0.00kB/s   0.00kbps  14.1Ki  42.4Ki   128Ki    2048Mi  100.4ms/50205
  1.004s   32.1MB/s    257Mbps  7767Ki  11.0Mi  37.8Mi    2048Mi  100.5ms/47
  2.051s    249MB/s   1995Mbps  68.3Mi  24.9Mi  62.5Mi    2048Mi  100.7ms/134  retrans=585
  3.056s    260MB/s   2081Mbps  68.3Mi  24.9Mi  62.5Mi    2048Mi  100.6ms/147
  4.061s    260MB/s   2082Mbps  68.3Mi  21.9Mi  62.5Mi    2048Mi  100.5ms/23
  5.024s    274MB/s   2194Mbps  68.3Mi  24.3Mi  62.5Mi    2048Mi  100.2ms/64
  6.028s    313MB/s   2500Mbps  68.3Mi  26.4Mi  62.5Mi    2048Mi  100.1ms/58
  7.034s    325MB/s   2603Mbps  68.3Mi  29.6Mi  62.5Mi    2048Mi  100.1ms/73
  8.040s    325MB/s   2602Mbps  68.3Mi  28.9Mi  62.5Mi    2048Mi  100.1ms/40
  9.046s    326MB/s   2604Mbps  68.3Mi  29.7Mi  62.5Mi    2048Mi  100.2ms/112
 10.051s    325MB/s   2603Mbps  68.3Mi  27.9Mi  62.5Mi    2048Mi  100.1ms/100
Transferred 2703MBytes in 10.154s, 20625 syscalls, 131072.0 Bytes/syscall

Small Cwnd

Congestion control algorithms decide congestion window (Cwnd).

With RTT = 100ms, FreeBSD newreno CC sometimes increases Cwnd slowly.

In the following example, reaches max bandwidth after ~30 seconds.

freebsd:~/recipes/tpc % bin/tcpperf -c 172.16.0.59 -b 100G -t 30
Connected 172.16.0.77:31839 -> 172.16.0.59:2009, congestion control: newreno
Time (s)  Throughput   Bitrate    Cwnd    Rwnd  sndbuf  ssthresh  rtt/var
  0.000s   0.00kB/s   0.00kbps  14.1Ki  63.6Ki  32.8Ki    1024Mi  202.0ms/101000
  1.104s    356kB/s   2849kbps  45.5Ki   435Ki  96.8Ki    1024Mi  123.8ms/42562
  2.007s    580kB/s   4643kbps  71.2Ki   589Ki   137Ki    1024Mi  107.5ms/13937
  3.010s    915kB/s   7318kbps  99.7Ki   593Ki   201Ki    1024Mi  102.3ms/3750
  4.014s   1176kB/s   9404kbps   128Ki   593Ki   257Ki    1024Mi  100.9ms/1187
  5.017s   1437kB/s   11.5Mbps   162Ki  1307Ki   321Ki    1024Mi  100.5ms/687
  6.021s   1829kB/s   14.6Mbps   191Ki  1307Ki   377Ki    1024Mi  100.4ms/562
  7.024s   2090kB/s   16.7Mbps   219Ki  1287Ki   441Ki    1024Mi  100.4ms/687
  8.028s   2481kB/s   19.9Mbps   248Ki  1266Ki   497Ki    1024Mi  100.5ms/687
  9.032s   2611kB/s   20.9Mbps   276Ki  1246Ki   553Ki    1024Mi  100.5ms/750
 10.035s   3005kB/s   24.0Mbps   311Ki  1820Ki   617Ki    1024Mi  100.4ms/500
 11.038s   3396kB/s   27.2Mbps   339Ki  1894Ki   681Ki    1024Mi  100.4ms/500
 12.042s   3526kB/s   28.2Mbps   368Ki  1919Ki   737Ki    1024Mi  100.5ms/625
 13.047s   4045kB/s   32.4Mbps   418Ki  1852Ki   857Ki    1024Mi  101.2ms/1375
 14.050s   4701kB/s   37.6Mbps   480Ki  1868Ki  1009Ki    1024Mi  100.4ms/562
 15.054s   5354kB/s   42.8Mbps   553Ki  2361Ki  1169Ki    1024Mi  101.3ms/2000
 16.039s   6255kB/s   50.0Mbps   641Ki  2681Ki  1377Ki    1024Mi  100.6ms/687
 17.002s   7212kB/s   57.7Mbps   727Ki  2696Ki  1561Ki    1024Mi  102.2ms/4062
 18.006s   8098kB/s   64.8Mbps   813Ki  2684Ki  1769Ki    1024Mi  100.7ms/687
 19.010s   8617kB/s   68.9Mbps   898Ki  2663Ki  1817Ki    1024Mi  100.5ms/625
 20.013s   9662kB/s   77.3Mbps  1004Ki  2914Ki  1817Ki    1024Mi  100.4ms/500
 21.017s   10.7MB/s   85.7Mbps  1118Ki  2893Ki  1817Ki    1024Mi  100.4ms/500
 22.021s   11.9MB/s   95.0Mbps  1235Ki  3010Ki  1817Ki    1024Mi  100.4ms/500
 23.025s   13.2MB/s    106Mbps  1369Ki  2989Ki  1817Ki    1024Mi  100.3ms/437
 24.029s   14.5MB/s    116Mbps  1497Ki  2916Ki  1817Ki    1024Mi  100.5ms/687
 25.033s   15.9MB/s    127Mbps  1634Ki  2930Ki  1817Ki    1024Mi  100.3ms/437
 26.037s   17.2MB/s    138Mbps  1796Ki  3030Ki  1817Ki    1024Mi  100.4ms/562
 27.001s   18.2MB/s    146Mbps  2010Ki  3066Ki  1817Ki    1024Mi  100.8ms/687
 28.005s   18.5MB/s    148Mbps  2191Ki  3026Ki  1817Ki    1024Mi  100.7ms/687
 29.008s   18.4MB/s    147Mbps  2419Ki  2986Ki  1817Ki    1024Mi  100.4ms/562
 30.012s   18.5MB/s    148Mbps  2619Ki  2992Ki  1817Ki    1024Mi  100.7ms/625
Transferred 234MBytes in 30.113s, 1787 syscalls, 131072.0 Bytes/syscall

Before Cwnd reaches sndbuf (1.8MB), throughput is dominated by Cwnd. e.g. at 20-th second: Cwnd = 1000K, throughput = 1000K / 0.1s = 10MB/s.

After that, throughput is domnated by sndbuf in this case. e.g. at 30-th second, sndbuf = 1.8MB, throughput = 18MB/s.

See discussion on freebsd-net mailing list 2023-05: https://lists.freebsd.org/archives/freebsd-net/2023-May/003282.html and Slow Start

Bandwidth limit

mininet> s1 tc qdisc replace dev s1-eth2 root netem delay 10ms rate 10Mbit

mininet> h1 bin/tcpperf -c h2
Connected 10.0.0.1:48982 -> 10.0.0.2:2009, congestion control: cubic
Time (s)  Throughput   Bitrate    Cwnd    Rwnd  sndbuf  ssthresh  rtt/var
  0.000s   0.00kB/s   0.00kbps  14.1Ki  42.4Ki   128Ki    2048Mi  10.7ms/5341
  1.120s   1638kB/s   13.1Mbps   132Ki   323Ki   790Ki    70.7Ki  106.2ms/574
  2.094s   1481kB/s   11.8Mbps   189Ki   416Ki  1139Ki    70.7Ki  152.8ms/1331
  3.018s   1418kB/s   11.3Mbps   243Ki   535Ki  1462Ki    70.7Ki  196.7ms/1030
  4.226s   1411kB/s   11.3Mbps   314Ki   680Ki  1887Ki    70.7Ki  254.5ms/1929
  5.196s   1351kB/s   10.8Mbps   370Ki   795Ki  2227Ki    70.7Ki  300.2ms/813
  6.357s   1354kB/s   10.8Mbps   438Ki   955Ki  2635Ki    70.7Ki  355.6ms/856
  7.009s   1408kB/s   11.3Mbps   475Ki  1021Ki  2856Ki    70.7Ki  386.2ms/374
  8.444s   1461kB/s   11.7Mbps   560Ki  1213Ki  3366Ki    70.7Ki  455.5ms/1334
  9.281s   1409kB/s   11.3Mbps   608Ki  1326Ki  3655Ki    70.7Ki  494.9ms/801
 10.160s   1342kB/s   10.7Mbps   660Ki  1399Ki  3970Ki    70.7Ki  537.0ms/1078
Transferred 14.5MBytes in 12.180s, 111 syscalls, 131072.0 Bytes/syscall

bandwidth

This is probably the best case, as all availabe bandwidth is utilized.

Performance

According to Understanding Host Network Stack Overheads SIGCOMM'21, modern Linux network stack can achieve ~42Gbps throughput-per-core. In other words, a single TCP connection can sustain a 40Gbps NIC unidirectionally, but not an 100Gbps NIC.

Eric reported 170Gbps single flow on 200Gbps NIC in netdev conf 2022-10, with receiver zero copy and BIG TCP.