Kernel changes
5.6 - 2020-03-29
- Adds
pidfd_getfd(2)
syscall. Grabbing file descriptors with pidfd_getfd()
5.4 - 2019-11-24
waitid()
syscall supportsP_PIDFD
flag. Adding the pidfd abstraction to the kernelsnd_wnd
added toTCP_INFO
. netdev thread. Ubuntu 20.04 and Debian 11 have it. iperf3 shows it in 3.10.SOMAXCONN
increased from 128 to 4096. commit, https://lkml.org/lkml/2019/11/13/510
5.3 - 2019-09-15
- Adds new
pidfd_open(2)
syscall. New system calls: pidfd_open() and close_range()
5.2 - 2019-07-07
- Adds the
CLONE_PIDFD
flag toclone(2)
. Rethinking race-free process signaling
5.1 - 2019-05-05
- Adds
pidfd_send_signal(2)
syscall. Toward race-free process signaling
4.17 - 2018-06-03
- Kernel TLS receive path
4.15 - 2018-01-28
- tcp: implement rb-tree based retransmit queue commit 75c119afe14f7
diff --git a/include/net/sock.h b/include/net/sock.h
index a6b9a8d1a6df..4827094f1db4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -397,7 +397,10 @@ struct sock {
int sk_wmem_queued;
refcount_t sk_wmem_alloc;
unsigned long sk_tsq_flags;
- struct sk_buff *sk_send_head;
+ union {
+ struct sk_buff *sk_send_head; // Front of stuff to transmit
+ struct rb_root tcp_rtx_queue; // TCP re-transmit queue
+ };
struct sk_buff_head sk_write_queue; // Packet sending queue
__s32 sk_peek_off;
int sk_write_pending;
4.13 - 2017-09-03
- TLS in the kernel, sending only.
4.12 - 2017-07-02
- Removed
net.ipv4.tcp_tw_recycle
option from Kernel commit. Ref. Coping with the TCP TIME-WAIT state on busy Linux servers by Vincent Bernat.
4.10 - 2017-02-19
- tcp: sender chronographs instrumentation
- tcp: drop SYN packets if accept queue is full commit.
4.9 - 2016-12-11
- BBR congestion control algorithm
- tcp: use an RB tree for ooo receive queue commit 9f5afeae51
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7be9b1242354..c723a465125d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -281,10 +281,9 @@ struct tcp_sock {
struct sk_buff* lost_skb_hint;
struct sk_buff *retransmit_skb_hint;
- /* OOO segments go in this list. Note that socket lock must be held,
- * as we do not use sk_buff_head lock.
- */
- struct sk_buff_head out_of_order_queue;
+ /* OOO segments go in this rbtree. Socket lock must be held. */
+ struct rb_root out_of_order_queue;
+ struct sk_buff *ooo_last_skb; /* cache rb_last(out_of_order_queue) */
4.8 - 2016-10-02
4.6 - 2016-05-15
- Faster SO_REUSEPORT for TCP
4.5 - 2016-03-13
- Faster SO_REUSEPORT for UDP
- Better epoll multithread scalability
4.4 - 2016-01-10
- TCP scalability
- Lockless TCP listener
- SO_INCOMING_CPU
- TCP_NEW_SYN_RECV
commit ca6fb06518836ef9b65dc0aac02ff97704d52a05
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Oct 2 11:43:35 2015 -0700
tcp: attach SYNACK messages to request sockets instead of listener
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.
(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)
By attaching the skb to the request socket, we remove this
source of contention.
Tested:
listen(fd, 10485760); // single listener (no SO_REUSEPORT)
16 RX/TX queue NIC
Sustain a SYNFLOOD attack of ~320,000 SYN per second,
Sending ~1,400,000 SYNACK per second.
Perf profiles now show listener spinlock being next bottleneck.
20.29% [kernel] [k] queued_spin_lock_slowpath
10.06% [kernel] [k] __inet_lookup_established
5.12% [kernel] [k] reqsk_timer_handler
3.22% [kernel] [k] get_next_timer_interrupt
3.00% [kernel] [k] tcp_make_synack
2.77% [kernel] [k] ipt_do_table
2.70% [kernel] [k] run_timer_softirq
2.50% [kernel] [k] ip_finish_output
2.04% [kernel] [k] cascade
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 079096f103faca2dd87342cca6f23d4b34da8871
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Oct 2 11:43:32 2015 -0700
tcp/dccp: install syn_recv requests into ehash table
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.
ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.
In nominal conditions, this halves pressure on listener lock.
Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.
We will shrink listen_sock in the following patch to ease
code review.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4.3 - 2015-11-01
- Adds direct Sockets syscalls to i386.
commit 9dea5dc921b5f4045a18c63eb92e84dc274d17eb
Author: Andy Lutomirski <luto@kernel.org>
Date: Tue Jul 14 15:24:24 2015 -0700
x86/entry/syscalls: Wire up 32-bit direct socket calls
On x86_64, there's no socketcall syscall; instead all of the
socket calls are real syscalls. For 32-bit programs, we're
stuck offering the socketcall syscall, but it would be nice to
expose the direct calls as well. This will enable seccomp to
filter socket calls (for new userspace only, but that's fine for
some applications) and it will provide a tiny performance boost.
glibc 2.23 2016-02-19. Ubuntu 16.04 supports it, Debian 8 doesn't.
https://github.com/bminor/glibc/commit/e5a5315e2d290fe34e0fb80996c713b8b802dcc9
commit e5a5315e2d290fe34e0fb80996c713b8b802dcc9
Author: Joseph Myers <joseph@codesourcery.com>
Date: Wed Dec 9 20:59:43 2015 +0000
Use direct socket syscalls for new kernels on i386, m68k, microblaze, sh.
Now that we have __ASSUME_* macros for direct socket syscalls to use
them instead of socketcall when they can be assumed to be available on
socketcall architectures, this patch defines those macros when
appropriate for i386, m68k, microblaze and sh (for 4.3, 4.3, all
supported kernels and 2.6.37, respectively; the only use of socketcall
support on microblaze is it allows accept4 and sendmmsg to be
supported on a wider range of kernel versions).
4.2 - 2015-08-30
commit 90c337da1524863838658078ec34241f45d8394d
Author: Eric Dumazet <edumazet@google.com>
Date: Sat Jun 6 21:17:57 2015 -0700
inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations
When an application needs to force a source IP on an active TCP socket
it has to use bind(IP, port=x).
As most applications do not want to deal with already used ports, x is
often set to 0, meaning the kernel is in charge to find an available
port.
But kernel does not know yet if this socket is going to be a listener or
be connected.
It has very limited choices (no full knowledge of final 4-tuple for a
connect())
With limited ephemeral port range (about 32K ports), it is very easy to
fill the space.
This patch adds a new SOL_IP socket option, asking kernel to ignore
the 0 port provided by application in bind(IP, port=0) and only
remember the given IP address.
The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.
This new feature is available for both IPv4 and IPv6 (Thanks Neal)