Documentation/networking/kcm.rst - linux - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 =============================
 Kernel Connection Multiplexor
 =============================

 Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
 interface over TCP for generic application protocols. With KCM an application
 can efficiently send and receive application protocol messages over TCP using
 datagram sockets.

 KCM implements an NxM multiplexor in the kernel as diagrammed below::

     +------------+   +------------+   +------------+   +------------+
     | KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
     +------------+   +------------+   +------------+   +------------+
 	|                 |               |                |
 	+-----------+     |               |     +----------+
 		    |     |               |     |
 		+----------------------------------+
 		|           Multiplexor            |
 		+----------------------------------+
 		    |   |           |           |  |
 	+---------+   |           |           |  ------------+
 	|             |           |           |              |
     +----------+  +----------+  +----------+  +----------+ +----------+
     |  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
     +----------+  +----------+  +----------+  +----------+ +----------+
 	|              |           |            |             |
     +----------+  +----------+  +----------+  +----------+ +----------+
     | TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
     +----------+  +----------+  +----------+  +----------+ +----------+

 KCM sockets
 ===========

 The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
 bound to a multiplexor are considered to have equivalent function, and I/O
 operations in different sockets may be done in parallel without the need for
 synchronization between threads in userspace.

 Multiplexor
 ===========

 The multiplexor provides the message steering. In the transmit path, messages
 written on a KCM socket are sent atomically on an appropriate TCP socket.
 Similarly, in the receive path, messages are constructed on each TCP socket
 (Psock) and complete messages are steered to a KCM socket.

 TCP sockets & Psocks
 ====================

 TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
 for each bound TCP socket, this structure holds the state for constructing
 messages on receive as well as other connection specific information for KCM.

 Connected mode semantics
 ========================

 Each multiplexor assumes that all attached TCP connections are to the same
 destination and can use the different connections for load balancing when
 transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
 can be used to send and receive messages from the KCM socket.

 Socket types
 ============

 KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.

 Message delineation
 -------------------

 Messages are sent over a TCP stream with some application protocol message
 format that typically includes a header which frames the messages. The length
 of a received message can be deduced from the application protocol header
 (often just a simple length field).

 A TCP stream must be parsed to determine message boundaries. Berkeley Packet
 Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
 BPF program must be specified. The program is called at the start of receiving
 a new message and is given an skbuff that contains the bytes received so far.
 It parses the message header and returns the length of the message. Given this
 information, KCM will construct the message of the stated length and deliver it
 to a KCM socket.

 TCP socket management
 ---------------------

 When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
 write space available (POLLOUT) events are handled by the multiplexor. If there
 is a state change (disconnection) or other error on a TCP socket, an error is
 posted on the TCP socket so that a POLLERR event happens and KCM discontinues
 using the socket. When the application gets the error notification for a
 TCP socket, it should unattach the socket from KCM and then handle the error
 condition (the typical response is to close the socket and create a new
 connection if necessary).

 KCM limits the maximum receive message size to be the size of the receive
 socket buffer on the attached TCP socket (the socket buffer size can be set by
 SO_RCVBUF). If the length of a new message reported by the BPF program is
 greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
 socket. The BPF program may also enforce a maximum messages size and report an
 error when it is exceeded.

 A timeout may be set for assembling messages on a receive socket. The timeout
 value is taken from the receive timeout of the attached TCP socket (this is set
 by SO_RCVTIMEO). If the timer expires before assembly is complete an error
 (ETIMEDOUT) is posted on the socket.

 User interface
 ==============

 Creating a multiplexor
 ----------------------

 A new multiplexor and initial KCM socket is created by a socket call::

   socket(AF_KCM, type, protocol)

 - type is either SOCK_DGRAM or SOCK_SEQPACKET
 - protocol is KCMPROTO_CONNECTED

 Cloning KCM sockets
 -------------------

 After the first KCM socket is created using the socket call as described
 above, additional sockets for the multiplexor can be created by cloning
 a KCM socket. This is accomplished by an ioctl on a KCM socket::

   /* From linux/kcm.h */
   struct kcm_clone {
 	int fd;
   };

   struct kcm_clone info;

   memset(&info, 0, sizeof(info));

   err = ioctl(kcmfd, SIOCKCMCLONE, &info);

   if (!err)
     newkcmfd = info.fd;

 Attach transport sockets
 ------------------------

 Attaching of transport sockets to a multiplexor is performed by calling an
 ioctl on a KCM socket for the multiplexor. e.g.::

   /* From linux/kcm.h */
   struct kcm_attach {
 	int fd;
 	int bpf_fd;
   };

   struct kcm_attach info;

   memset(&info, 0, sizeof(info));

   info.fd = tcpfd;
   info.bpf_fd = bpf_prog_fd;

   ioctl(kcmfd, SIOCKCMATTACH, &info);

 The kcm_attach structure contains:

   - fd: file descriptor for TCP socket being attached
   - bpf_prog_fd: file descriptor for compiled BPF program downloaded

 Unattach transport sockets
 --------------------------

 Unattaching a transport socket from a multiplexor is straightforward. An
 "unattach" ioctl is done with the kcm_unattach structure as the argument::

   /* From linux/kcm.h */
   struct kcm_unattach {
 	int fd;
   };

   struct kcm_unattach info;

   memset(&info, 0, sizeof(info));

   info.fd = cfd;

   ioctl(fd, SIOCKCMUNATTACH, &info);

 Disabling receive on KCM socket
 -------------------------------

 A setsockopt is used to disable or enable receiving on a KCM socket.
 When receive is disabled, any pending messages in the socket's
 receive buffer are moved to other sockets. This feature is useful
 if an application thread knows that it will be doing a lot of
 work on a request and won't be able to service new messages for a
 while. Example use::

   int val = 1;

   setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))

 BFP programs for message delineation
 ------------------------------------

 BPF programs can be compiled using the BPF LLVM backend. For example,
 the BPF program for parsing Thrift is::

   #include "bpf.h" /* for __sk_buff */
   #include "bpf_helpers.h" /* for load_word intrinsic */

   SEC("socket_kcm")
   int bpf_prog1(struct __sk_buff *skb)
   {
        return load_word(skb, 0) + 4;
   }

   char _license[] SEC("license") = "GPL";

 Use in applications
 ===================

 KCM accelerates application layer protocols. Specifically, it allows
 applications to use a message based interface for sending and receiving
 messages. The kernel provides necessary assurances that messages are sent
 and received atomically. This relieves much of the burden applications have
 in mapping a message based protocol onto the TCP stream. KCM also make
 application layer messages a unit of work in the kernel for the purposes of
 steering and scheduling, which in turn allows a simpler networking model in
 multithreaded applications.

 Configurations
 --------------

 In an Nx1 configuration, KCM logically provides multiple socket handles
 to the same TCP connection. This allows parallelism between in I/O
 operations on the TCP socket (for instance copyin and copyout of data is
 parallelized). In an application, a KCM socket can be opened for each
 processing thread and inserted into the epoll (similar to how SO_REUSEPORT
 is used to allow multiple listener sockets on the same port).

 In a MxN configuration, multiple connections are established to the
 same destination. These are used for simple load balancing.

 Message batching
 ----------------

 The primary purpose of KCM is load balancing between KCM sockets and hence
 threads in a nominal use case. Perfect load balancing, that is steering
 each received message to a different KCM socket or steering each sent
 message to a different TCP socket, can negatively impact performance
 since this doesn't allow for affinities to be established. Balancing
 based on groups, or batches of messages, can be beneficial for performance.

 On transmit, there are three ways an application can batch (pipeline)
 messages on a KCM socket.

   1) Send multiple messages in a single sendmmsg.
   2) Send a group of messages each with a sendmsg call, where all messages
      except the last have MSG_BATCH in the flags of sendmsg call.
   3) Create "super message" composed of multiple messages and send this
      with a single sendmsg.

 On receive, the KCM module attempts to queue messages received on the
 same KCM socket during each TCP ready callback. The targeted KCM socket
 changes at each receive ready callback on the KCM socket. The application
 does not need to configure this.

 Error handling
 --------------

 An application should include a thread to monitor errors raised on
 the TCP connection. Normally, this will be done by placing each
 TCP socket attached to a KCM multiplexor in epoll set for POLLERR
 event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
 on the socket thus waking up the application thread. When the application
 sees the error (which may just be a disconnect) it should unattach the
 socket from KCM and then close it. It is assumed that once an error is
 posted on the TCP socket the data stream is unrecoverable (i.e. an error
 may have occurred in the middle of receiving a message).

 TCP connection monitoring
 -------------------------

 In KCM there is no means to correlate a message to the TCP socket that
 was used to send or receive the message (except in the case there is
 only one attached TCP socket). However, the application does retain
 an open file descriptor to the socket so it will be able to get statistics
 from the socket which can be used in detecting issues (such as high
 retransmissions on the socket).
	.. SPDX-License-Identifier: GPL-2.0

	=============================
	Kernel Connection Multiplexor
	=============================

	Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
	interface over TCP for generic application protocols. With KCM an application
	can efficiently send and receive application protocol messages over TCP using
	datagram sockets.

	KCM implements an NxM multiplexor in the kernel as diagrammed below::

	+------------+ +------------+ +------------+ +------------+
	\| KCM socket \| \| KCM socket \| \| KCM socket \| \| KCM socket \|
	+------------+ +------------+ +------------+ +------------+
	\| \| \| \|
	+-----------+ \| \| +----------+
	\| \| \| \|
	+----------------------------------+
	\| Multiplexor \|
	+----------------------------------+
	\| \| \| \| \|
	+---------+ \| \| \| ------------+
	\| \| \| \| \|
	+----------+ +----------+ +----------+ +----------+ +----------+
	\| Psock \| \| Psock \| \| Psock \| \| Psock \| \| Psock \|
	+----------+ +----------+ +----------+ +----------+ +----------+
	\| \| \| \| \|
	+----------+ +----------+ +----------+ +----------+ +----------+
	\| TCP sock \| \| TCP sock \| \| TCP sock \| \| TCP sock \| \| TCP sock \|
	+----------+ +----------+ +----------+ +----------+ +----------+

	KCM sockets
	===========

	The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
	bound to a multiplexor are considered to have equivalent function, and I/O
	operations in different sockets may be done in parallel without the need for
	synchronization between threads in userspace.

	Multiplexor
	===========

	The multiplexor provides the message steering. In the transmit path, messages
	written on a KCM socket are sent atomically on an appropriate TCP socket.
	Similarly, in the receive path, messages are constructed on each TCP socket
	(Psock) and complete messages are steered to a KCM socket.

	TCP sockets & Psocks
	====================

	TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
	for each bound TCP socket, this structure holds the state for constructing
	messages on receive as well as other connection specific information for KCM.

	Connected mode semantics
	========================

	Each multiplexor assumes that all attached TCP connections are to the same
	destination and can use the different connections for load balancing when
	transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
	can be used to send and receive messages from the KCM socket.

	Socket types
	============

	KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.

	Message delineation
	-------------------

	Messages are sent over a TCP stream with some application protocol message
	format that typically includes a header which frames the messages. The length
	of a received message can be deduced from the application protocol header
	(often just a simple length field).

	A TCP stream must be parsed to determine message boundaries. Berkeley Packet
	Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
	BPF program must be specified. The program is called at the start of receiving
	a new message and is given an skbuff that contains the bytes received so far.
	It parses the message header and returns the length of the message. Given this
	information, KCM will construct the message of the stated length and deliver it
	to a KCM socket.

	TCP socket management
	---------------------

	When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
	write space available (POLLOUT) events are handled by the multiplexor. If there
	is a state change (disconnection) or other error on a TCP socket, an error is
	posted on the TCP socket so that a POLLERR event happens and KCM discontinues
	using the socket. When the application gets the error notification for a
	TCP socket, it should unattach the socket from KCM and then handle the error
	condition (the typical response is to close the socket and create a new
	connection if necessary).

	KCM limits the maximum receive message size to be the size of the receive
	socket buffer on the attached TCP socket (the socket buffer size can be set by
	SO_RCVBUF). If the length of a new message reported by the BPF program is
	greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
	socket. The BPF program may also enforce a maximum messages size and report an
	error when it is exceeded.

	A timeout may be set for assembling messages on a receive socket. The timeout
	value is taken from the receive timeout of the attached TCP socket (this is set
	by SO_RCVTIMEO). If the timer expires before assembly is complete an error
	(ETIMEDOUT) is posted on the socket.

	User interface
	==============

	Creating a multiplexor
	----------------------

	A new multiplexor and initial KCM socket is created by a socket call::

	socket(AF_KCM, type, protocol)

	- type is either SOCK_DGRAM or SOCK_SEQPACKET
	- protocol is KCMPROTO_CONNECTED

	Cloning KCM sockets
	-------------------

	After the first KCM socket is created using the socket call as described
	above, additional sockets for the multiplexor can be created by cloning
	a KCM socket. This is accomplished by an ioctl on a KCM socket::

	/* From linux/kcm.h */
	struct kcm_clone {
	int fd;
	};

	struct kcm_clone info;

	memset(&info, 0, sizeof(info));

	err = ioctl(kcmfd, SIOCKCMCLONE, &info);

	if (!err)
	newkcmfd = info.fd;

	Attach transport sockets
	------------------------

	Attaching of transport sockets to a multiplexor is performed by calling an
	ioctl on a KCM socket for the multiplexor. e.g.::

	/* From linux/kcm.h */
	struct kcm_attach {
	int fd;
	int bpf_fd;
	};

	struct kcm_attach info;

	memset(&info, 0, sizeof(info));

	info.fd = tcpfd;
	info.bpf_fd = bpf_prog_fd;

	ioctl(kcmfd, SIOCKCMATTACH, &info);

	The kcm_attach structure contains:

	- fd: file descriptor for TCP socket being attached
	- bpf_prog_fd: file descriptor for compiled BPF program downloaded

	Unattach transport sockets
	--------------------------

	Unattaching a transport socket from a multiplexor is straightforward. An
	"unattach" ioctl is done with the kcm_unattach structure as the argument::

	/* From linux/kcm.h */
	struct kcm_unattach {
	int fd;
	};

	struct kcm_unattach info;

	memset(&info, 0, sizeof(info));

	info.fd = cfd;

	ioctl(fd, SIOCKCMUNATTACH, &info);

	Disabling receive on KCM socket
	-------------------------------

	A setsockopt is used to disable or enable receiving on a KCM socket.
	When receive is disabled, any pending messages in the socket's
	receive buffer are moved to other sockets. This feature is useful
	if an application thread knows that it will be doing a lot of
	work on a request and won't be able to service new messages for a
	while. Example use::

	int val = 1;

	setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))

	BFP programs for message delineation
	------------------------------------

	BPF programs can be compiled using the BPF LLVM backend. For example,
	the BPF program for parsing Thrift is::

	#include "bpf.h" /* for __sk_buff */
	#include "bpf_helpers.h" /* for load_word intrinsic */

	SEC("socket_kcm")
	int bpf_prog1(struct __sk_buff *skb)
	{
	return load_word(skb, 0) + 4;
	}

	char _license[] SEC("license") = "GPL";

	Use in applications
	===================

	KCM accelerates application layer protocols. Specifically, it allows
	applications to use a message based interface for sending and receiving
	messages. The kernel provides necessary assurances that messages are sent
	and received atomically. This relieves much of the burden applications have
	in mapping a message based protocol onto the TCP stream. KCM also make
	application layer messages a unit of work in the kernel for the purposes of
	steering and scheduling, which in turn allows a simpler networking model in
	multithreaded applications.

	Configurations
	--------------

	In an Nx1 configuration, KCM logically provides multiple socket handles
	to the same TCP connection. This allows parallelism between in I/O
	operations on the TCP socket (for instance copyin and copyout of data is
	parallelized). In an application, a KCM socket can be opened for each
	processing thread and inserted into the epoll (similar to how SO_REUSEPORT
	is used to allow multiple listener sockets on the same port).

	In a MxN configuration, multiple connections are established to the
	same destination. These are used for simple load balancing.

	Message batching
	----------------

	The primary purpose of KCM is load balancing between KCM sockets and hence
	threads in a nominal use case. Perfect load balancing, that is steering
	each received message to a different KCM socket or steering each sent
	message to a different TCP socket, can negatively impact performance
	since this doesn't allow for affinities to be established. Balancing
	based on groups, or batches of messages, can be beneficial for performance.

	On transmit, there are three ways an application can batch (pipeline)
	messages on a KCM socket.

	1) Send multiple messages in a single sendmmsg.
	2) Send a group of messages each with a sendmsg call, where all messages
	except the last have MSG_BATCH in the flags of sendmsg call.
	3) Create "super message" composed of multiple messages and send this
	with a single sendmsg.

	On receive, the KCM module attempts to queue messages received on the
	same KCM socket during each TCP ready callback. The targeted KCM socket
	changes at each receive ready callback on the KCM socket. The application
	does not need to configure this.

	Error handling
	--------------

	An application should include a thread to monitor errors raised on
	the TCP connection. Normally, this will be done by placing each
	TCP socket attached to a KCM multiplexor in epoll set for POLLERR
	event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
	on the socket thus waking up the application thread. When the application
	sees the error (which may just be a disconnect) it should unattach the
	socket from KCM and then close it. It is assumed that once an error is
	posted on the TCP socket the data stream is unrecoverable (i.e. an error
	may have occurred in the middle of receiving a message).

	TCP connection monitoring
	-------------------------

	In KCM there is no means to correlate a message to the TCP socket that
	was used to send or receive the message (except in the case there is
	only one attached TCP socket). However, the application does retain
	an open file descriptor to the socket so it will be able to get statistics
	from the socket which can be used in detecting issues (such as high
	retransmissions on the socket).