|
@@ -85,7 +85,8 @@ Socket Interface
|
|
|
|
|
|
bind(fd, &sockaddr_in, ...)
|
|
|
This binds the socket to a local IP address and port, and a
|
|
|
- transport.
|
|
|
+ transport, if one has not already been selected via the
|
|
|
+ SO_RDS_TRANSPORT socket option
|
|
|
|
|
|
sendmsg(fd, ...)
|
|
|
Sends a message to the indicated recipient. The kernel will
|
|
@@ -146,6 +147,20 @@ Socket Interface
|
|
|
operation. In this case, it would use RDS_CANCEL_SENT_TO to
|
|
|
nuke any pending messages.
|
|
|
|
|
|
+ setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
|
|
|
+ getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
|
|
|
+ Set or read an integer defining the underlying
|
|
|
+ encapsulating transport to be used for RDS packets on the
|
|
|
+ socket. When setting the option, integer argument may be
|
|
|
+ one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
|
|
|
+ value, RDS_TRANS_NONE will be returned on an unbound socket.
|
|
|
+ This socket option may only be set exactly once on the socket,
|
|
|
+ prior to binding it via the bind(2) system call. Attempts to
|
|
|
+ set SO_RDS_TRANSPORT on a socket for which the transport has
|
|
|
+ been previously attached explicitly (by SO_RDS_TRANSPORT) or
|
|
|
+ implicitly (via bind(2)) will return an error of EOPNOTSUPP.
|
|
|
+ An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will
|
|
|
+ always return EINVAL.
|
|
|
|
|
|
RDMA for RDS
|
|
|
============
|
|
@@ -350,4 +365,59 @@ The recv path
|
|
|
handle CMSGs
|
|
|
return to application
|
|
|
|
|
|
+Multipath RDS (mprds)
|
|
|
+=====================
|
|
|
+ Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
|
|
|
+ (though the concept can be extended to other transports). The classical
|
|
|
+ implementation of RDS-over-TCP is implemented by demultiplexing multiple
|
|
|
+ PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
|
|
|
+ port]) over a single TCP socket between the 2 IP addresses involved. This
|
|
|
+ has the limitation that it ends up funneling multiple RDS flows over a
|
|
|
+ single TCP flow, thus it is
|
|
|
+ (a) upper-bounded to the single-flow bandwidth,
|
|
|
+ (b) suffers from head-of-line blocking for all the RDS sockets.
|
|
|
+
|
|
|
+ Better throughput (for a fixed small packet size, MTU) can be achieved
|
|
|
+ by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
|
|
|
+ RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp
|
|
|
+ connection. RDS sockets will be attached to a path based on some hash
|
|
|
+ (e.g., of local address and RDS port number) and packets for that RDS
|
|
|
+ socket will be sent over the attached path using TCP to segment/reassemble
|
|
|
+ RDS datagrams on that path.
|
|
|
+
|
|
|
+ Multipathed RDS is implemented by splitting the struct rds_connection into
|
|
|
+ a common (to all paths) part, and a per-path struct rds_conn_path. All
|
|
|
+ I/O workqs and reconnect threads are driven from the rds_conn_path.
|
|
|
+ Transports such as TCP that are multipath capable may then set up a
|
|
|
+ TPC socket per rds_conn_path, and this is managed by the transport via
|
|
|
+ the transport privatee cp_transport_data pointer.
|
|
|
+
|
|
|
+ Transports announce themselves as multipath capable by setting the
|
|
|
+ t_mp_capable bit during registration with the rds core module. When the
|
|
|
+ transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
|
|
|
+ across multiple paths. The outgoing hash is computed based on the
|
|
|
+ local address and port that the PF_RDS socket is bound to.
|
|
|
+
|
|
|
+ Additionally, even if the transport is MP capable, we may be
|
|
|
+ peering with some node that does not support mprds, or supports
|
|
|
+ a different number of paths. As a result, the peering nodes need
|
|
|
+ to agree on the number of paths to be used for the connection.
|
|
|
+ This is done by sending out a control packet exchange before the
|
|
|
+ first data packet. The control packet exchange must have completed
|
|
|
+ prior to outgoing hash completion in rds_sendmsg() when the transport
|
|
|
+ is mutlipath capable.
|
|
|
+
|
|
|
+ The control packet is an RDS ping packet (i.e., packet to rds dest
|
|
|
+ port 0) with the ping packet having a rds extension header option of
|
|
|
+ type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
|
|
|
+ number of paths supported by the sender. The "probe" ping packet will
|
|
|
+ get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
|
|
|
+ The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
|
|
|
+ be able to compute the min(sender_paths, rcvr_paths). The pong
|
|
|
+ sent in response to a probe-ping should contain the rcvr's npaths
|
|
|
+ when the rcvr is mprds-capable.
|
|
|
+
|
|
|
+ If the rcvr is not mprds-capable, the exthdr in the ping will be
|
|
|
+ ignored. In this case the pong will not have any exthdrs, so the sender
|
|
|
+ of the probe-ping can default to single-path mprds.
|
|
|
|