|
@@ -1,102 +1,307 @@
|
|
|
-The existing interfaces for getting network packages time stamped are:
|
|
|
+
|
|
|
+1. Control Interfaces
|
|
|
+
|
|
|
+The interfaces for receiving network packages timestamps are:
|
|
|
|
|
|
* SO_TIMESTAMP
|
|
|
- Generate time stamp for each incoming packet using the (not necessarily
|
|
|
- monotonous!) system time. Result is returned via recv_msg() in a
|
|
|
- control message as timeval (usec resolution).
|
|
|
+ Generates a timestamp for each incoming packet in (not necessarily
|
|
|
+ monotonic) system time. Reports the timestamp via recvmsg() in a
|
|
|
+ control message as struct timeval (usec resolution).
|
|
|
|
|
|
* SO_TIMESTAMPNS
|
|
|
- Same time stamping mechanism as SO_TIMESTAMP, but returns result as
|
|
|
- timespec (nsec resolution).
|
|
|
+ Same timestamping mechanism as SO_TIMESTAMP, but reports the
|
|
|
+ timestamp as struct timespec (nsec resolution).
|
|
|
|
|
|
* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
|
|
|
- Only for multicasts: approximate send time stamp by receiving the looped
|
|
|
- packet and using its receive time stamp.
|
|
|
+ Only for multicast:approximate transmit timestamp obtained by
|
|
|
+ reading the looped packet receive timestamp.
|
|
|
|
|
|
-The following interface complements the existing ones: receive time
|
|
|
-stamps can be generated and returned for arbitrary packets and much
|
|
|
-closer to the point where the packet is really sent. Time stamps can
|
|
|
-be generated in software (as before) or in hardware (if the hardware
|
|
|
-has such a feature).
|
|
|
+* SO_TIMESTAMPING
|
|
|
+ Generates timestamps on reception, transmission or both. Supports
|
|
|
+ multiple timestamp sources, including hardware. Supports generating
|
|
|
+ timestamps for stream sockets.
|
|
|
|
|
|
-SO_TIMESTAMPING:
|
|
|
|
|
|
-Instructs the socket layer which kind of information should be collected
|
|
|
-and/or reported. The parameter is an integer with some of the following
|
|
|
-bits set. Setting other bits is an error and doesn't change the current
|
|
|
-state.
|
|
|
+1.1 SO_TIMESTAMP:
|
|
|
|
|
|
-Four of the bits are requests to the stack to try to generate
|
|
|
-timestamps. Any combination of them is valid.
|
|
|
+This socket option enables timestamping of datagrams on the reception
|
|
|
+path. Because the destination socket, if any, is not known early in
|
|
|
+the network stack, the feature has to be enabled for all packets. The
|
|
|
+same is true for all early receive timestamp options.
|
|
|
|
|
|
-SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamps in hardware
|
|
|
-SOF_TIMESTAMPING_TX_SOFTWARE: try to obtain send time stamps in software
|
|
|
-SOF_TIMESTAMPING_RX_HARDWARE: try to obtain receive time stamps in hardware
|
|
|
-SOF_TIMESTAMPING_RX_SOFTWARE: try to obtain receive time stamps in software
|
|
|
+For interface details, see `man 7 socket`.
|
|
|
+
|
|
|
+
|
|
|
+1.2 SO_TIMESTAMPNS:
|
|
|
+
|
|
|
+This option is identical to SO_TIMESTAMP except for the returned data type.
|
|
|
+Its struct timespec allows for higher resolution (ns) timestamps than the
|
|
|
+timeval of SO_TIMESTAMP (ms).
|
|
|
+
|
|
|
+
|
|
|
+1.3 SO_TIMESTAMPING:
|
|
|
+
|
|
|
+Supports multiple types of timestamp requests. As a result, this
|
|
|
+socket option takes a bitmap of flags, not a boolean. In
|
|
|
+
|
|
|
+ err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val);
|
|
|
+
|
|
|
+val is an integer with any of the following bits set. Setting other
|
|
|
+bit returns EINVAL and does not change the current state.
|
|
|
|
|
|
-The other three bits control which timestamps will be reported in a
|
|
|
-generated control message. If none of these bits are set or if none of
|
|
|
-the set bits correspond to data that is available, then the control
|
|
|
-message will not be generated:
|
|
|
|
|
|
-SOF_TIMESTAMPING_SOFTWARE: report systime if available
|
|
|
-SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available (deprecated)
|
|
|
-SOF_TIMESTAMPING_RAW_HARDWARE: report hwtimeraw if available
|
|
|
+1.3.1 Timestamp Generation
|
|
|
|
|
|
-It is worth noting that timestamps may be collected for reasons other
|
|
|
-than being requested by a particular socket with
|
|
|
-SOF_TIMESTAMPING_[TR]X_(HARD|SOFT)WARE. For example, most drivers that
|
|
|
-can generate hardware receive timestamps ignore
|
|
|
-SOF_TIMESTAMPING_RX_HARDWARE. It is still a good idea to set that flag
|
|
|
-in case future drivers pay attention.
|
|
|
+Some bits are requests to the stack to try to generate timestamps. Any
|
|
|
+combination of them is valid. Changes to these bits apply to newly
|
|
|
+created packets, not to packets already in the stack. As a result, it
|
|
|
+is possible to selectively request timestamps for a subset of packets
|
|
|
+(e.g., for sampling) by embedding an send() call within two setsockopt
|
|
|
+calls, one to enable timestamp generation and one to disable it.
|
|
|
+Timestamps may also be generated for reasons other than being
|
|
|
+requested by a particular socket, such as when receive timestamping is
|
|
|
+enabled system wide, as explained earlier.
|
|
|
|
|
|
-If timestamps are reported, they will appear in a control message with
|
|
|
-cmsg_level==SOL_SOCKET, cmsg_type==SO_TIMESTAMPING, and a payload like
|
|
|
-this:
|
|
|
+SOF_TIMESTAMPING_RX_HARDWARE:
|
|
|
+ Request rx timestamps generated by the network adapter.
|
|
|
+
|
|
|
+SOF_TIMESTAMPING_RX_SOFTWARE:
|
|
|
+ Request rx timestamps when data enters the kernel. These timestamps
|
|
|
+ are generated just after a device driver hands a packet to the
|
|
|
+ kernel receive stack.
|
|
|
+
|
|
|
+SOF_TIMESTAMPING_TX_HARDWARE:
|
|
|
+ Request tx timestamps generated by the network adapter.
|
|
|
+
|
|
|
+SOF_TIMESTAMPING_TX_SOFTWARE:
|
|
|
+ Request tx timestamps when data leaves the kernel. These timestamps
|
|
|
+ are generated in the device driver as close as possible, but always
|
|
|
+ prior to, passing the packet to the network interface. Hence, they
|
|
|
+ require driver support and may not be available for all devices.
|
|
|
+
|
|
|
+SOF_TIMESTAMPING_TX_SCHED:
|
|
|
+ Request tx timestamps prior to entering the packet scheduler. Kernel
|
|
|
+ transmit latency is, if long, often dominated by queuing delay. The
|
|
|
+ difference between this timestamp and one taken at
|
|
|
+ SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent
|
|
|
+ of protocol processing. The latency incurred in protocol
|
|
|
+ processing, if any, can be computed by subtracting a userspace
|
|
|
+ timestamp taken immediately before send() from this timestamp. On
|
|
|
+ machines with virtual devices where a transmitted packet travels
|
|
|
+ through multiple devices and, hence, multiple packet schedulers,
|
|
|
+ a timestamp is generated at each layer. This allows for fine
|
|
|
+ grained measurement of queuing delay.
|
|
|
+
|
|
|
+SOF_TIMESTAMPING_TX_ACK:
|
|
|
+ Request tx timestamps when all data in the send buffer has been
|
|
|
+ acknowledged. This only makes sense for reliable protocols. It is
|
|
|
+ currently only implemented for TCP. For that protocol, it may
|
|
|
+ over-report measurement, because the timestamp is generated when all
|
|
|
+ data up to and including the buffer at send() was acknowledged: the
|
|
|
+ cumulative acknowledgment. The mechanism ignores SACK and FACK.
|
|
|
+
|
|
|
+
|
|
|
+1.3.2 Timestamp Reporting
|
|
|
+
|
|
|
+The other three bits control which timestamps will be reported in a
|
|
|
+generated control message. Changes to the bits take immediate
|
|
|
+effect at the timestamp reporting locations in the stack. Timestamps
|
|
|
+are only reported for packets that also have the relevant timestamp
|
|
|
+generation request set.
|
|
|
+
|
|
|
+SOF_TIMESTAMPING_SOFTWARE:
|
|
|
+ Report any software timestamps when available.
|
|
|
+
|
|
|
+SOF_TIMESTAMPING_SYS_HARDWARE:
|
|
|
+ This option is deprecated and ignored.
|
|
|
+
|
|
|
+SOF_TIMESTAMPING_RAW_HARDWARE:
|
|
|
+ Report hardware timestamps as generated by
|
|
|
+ SOF_TIMESTAMPING_TX_HARDWARE when available.
|
|
|
+
|
|
|
+
|
|
|
+1.3.3 Timestamp Options
|
|
|
+
|
|
|
+The interface supports one option
|
|
|
+
|
|
|
+SOF_TIMESTAMPING_OPT_ID:
|
|
|
+
|
|
|
+ Generate a unique identifier along with each packet. A process can
|
|
|
+ have multiple concurrent timestamping requests outstanding. Packets
|
|
|
+ can be reordered in the transmit path, for instance in the packet
|
|
|
+ scheduler. In that case timestamps will be queued onto the error
|
|
|
+ queue out of order from the original send() calls. This option
|
|
|
+ embeds a counter that is incremented at send() time, to order
|
|
|
+ timestamps within a flow.
|
|
|
+
|
|
|
+ This option is implemented only for transmit timestamps. There, the
|
|
|
+ timestamp is always looped along with a struct sock_extended_err.
|
|
|
+ The option modifies field ee_info to pass an id that is unique
|
|
|
+ among all possibly concurrently outstanding timestamp requests for
|
|
|
+ that socket. In practice, it is a monotonically increasing u32
|
|
|
+ (that wraps).
|
|
|
+
|
|
|
+ In datagram sockets, the counter increments on each send call. In
|
|
|
+ stream sockets, it increments with every byte.
|
|
|
+
|
|
|
+
|
|
|
+1.4 Bytestream Timestamps
|
|
|
+
|
|
|
+The SO_TIMESTAMPING interface supports timestamping of bytes in a
|
|
|
+bytestream. Each request is interpreted as a request for when the
|
|
|
+entire contents of the buffer has passed a timestamping point. That
|
|
|
+is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record
|
|
|
+when all bytes have reached the device driver, regardless of how
|
|
|
+many packets the data has been converted into.
|
|
|
+
|
|
|
+In general, bytestreams have no natural delimiters and therefore
|
|
|
+correlating a timestamp with data is non-trivial. A range of bytes
|
|
|
+may be split across segments, any segments may be merged (possibly
|
|
|
+coalescing sections of previously segmented buffers associated with
|
|
|
+independent send() calls). Segments can be reordered and the same
|
|
|
+byte range can coexist in multiple segments for protocols that
|
|
|
+implement retransmissions.
|
|
|
+
|
|
|
+It is essential that all timestamps implement the same semantics,
|
|
|
+regardless of these possible transformations, as otherwise they are
|
|
|
+incomparable. Handling "rare" corner cases differently from the
|
|
|
+simple case (a 1:1 mapping from buffer to skb) is insufficient
|
|
|
+because performance debugging often needs to focus on such outliers.
|
|
|
+
|
|
|
+In practice, timestamps can be correlated with segments of a
|
|
|
+bytestream consistently, if both semantics of the timestamp and the
|
|
|
+timing of measurement are chosen correctly. This challenge is no
|
|
|
+different from deciding on a strategy for IP fragmentation. There, the
|
|
|
+definition is that only the first fragment is timestamped. For
|
|
|
+bytestreams, we chose that a timestamp is generated only when all
|
|
|
+bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to
|
|
|
+implement and reason about. An implementation that has to take into
|
|
|
+account SACK would be more complex due to possible transmission holes
|
|
|
+and out of order arrival.
|
|
|
+
|
|
|
+On the host, TCP can also break the simple 1:1 mapping from buffer to
|
|
|
+skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The
|
|
|
+implementation ensures correctness in all cases by tracking the
|
|
|
+individual last byte passed to send(), even if it is no longer the
|
|
|
+last byte after an skbuff extend or merge operation. It stores the
|
|
|
+relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff
|
|
|
+has only one such field, only one timestamp can be generated.
|
|
|
+
|
|
|
+In rare cases, a timestamp request can be missed if two requests are
|
|
|
+collapsed onto the same skb. A process can detect this situation by
|
|
|
+enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at
|
|
|
+send time with the value returned for each timestamp. It can prevent
|
|
|
+the situation by always flushing the TCP stack in between requests,
|
|
|
+for instance by enabling TCP_NODELAY and disabling TCP_CORK and
|
|
|
+autocork.
|
|
|
+
|
|
|
+These precautions ensure that the timestamp is generated only when all
|
|
|
+bytes have passed a timestamp point, assuming that the network stack
|
|
|
+itself does not reorder the segments. The stack indeed tries to avoid
|
|
|
+reordering. The one exception is under administrator control: it is
|
|
|
+possible to construct a packet scheduler configuration that delays
|
|
|
+segments from the same stream differently. Such a setup would be
|
|
|
+unusual.
|
|
|
+
|
|
|
+
|
|
|
+2 Data Interfaces
|
|
|
+
|
|
|
+Timestamps are read using the ancillary data feature of recvmsg().
|
|
|
+See `man 3 cmsg` for details of this interface. The socket manual
|
|
|
+page (`man 7 socket`) describes how timestamps generated with
|
|
|
+SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.
|
|
|
+
|
|
|
+
|
|
|
+2.1 SCM_TIMESTAMPING records
|
|
|
+
|
|
|
+These timestamps are returned in a control message with cmsg_level
|
|
|
+SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
|
|
|
|
|
|
struct scm_timestamping {
|
|
|
- struct timespec systime;
|
|
|
- struct timespec hwtimetrans;
|
|
|
- struct timespec hwtimeraw;
|
|
|
+ struct timespec ts[3];
|
|
|
};
|
|
|
|
|
|
-recvmsg() can be used to get this control message for regular incoming
|
|
|
-packets. For send time stamps the outgoing packet is looped back to
|
|
|
-the socket's error queue with the send time stamp(s) attached. It can
|
|
|
-be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the
|
|
|
-original outgoing packet data including all headers preprended down to
|
|
|
-and including the link layer, the scm_timestamping control message and
|
|
|
-a sock_extended_err control message with ee_errno==ENOMSG and
|
|
|
-ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
|
|
|
-bounced packet is ready for reading as far as select() is concerned.
|
|
|
-If the outgoing packet has to be fragmented, then only the first
|
|
|
-fragment is time stamped and returned to the sending socket.
|
|
|
-
|
|
|
-All three values correspond to the same event in time, but were
|
|
|
-generated in different ways. Each of these values may be empty (= all
|
|
|
-zero), in which case no such value was available. If the application
|
|
|
-is not interested in some of these values, they can be left blank to
|
|
|
-avoid the potential overhead of calculating them.
|
|
|
-
|
|
|
-systime is the value of the system time at that moment. This
|
|
|
-corresponds to the value also returned via SO_TIMESTAMP[NS]. If the
|
|
|
-time stamp was generated by hardware, then this field is
|
|
|
-empty. Otherwise it is filled in if SOF_TIMESTAMPING_SOFTWARE is
|
|
|
-set.
|
|
|
-
|
|
|
-hwtimeraw is the original hardware time stamp. Filled in if
|
|
|
-SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its
|
|
|
-relation to system time should be made.
|
|
|
-
|
|
|
-hwtimetrans is always zero. This field is deprecated. It used to hold
|
|
|
-hw timestamps converted to system time. Instead, expose the hardware
|
|
|
-clock device on the NIC directly as a HW PTP clock source, to allow
|
|
|
-time conversion in userspace and optionally synchronize system time
|
|
|
-with a userspace PTP stack such as linuxptp. For the PTP clock API,
|
|
|
-see Documentation/ptp/ptp.txt.
|
|
|
-
|
|
|
-
|
|
|
-SIOCSHWTSTAMP, SIOCGHWTSTAMP:
|
|
|
+The structure can return up to three timestamps. This is a legacy
|
|
|
+feature. Only one field is non-zero at any time. Most timestamps
|
|
|
+are passed in ts[0]. Hardware timestamps are passed in ts[2].
|
|
|
+
|
|
|
+ts[1] used to hold hardware timestamps converted to system time.
|
|
|
+Instead, expose the hardware clock device on the NIC directly as
|
|
|
+a HW PTP clock source, to allow time conversion in userspace and
|
|
|
+optionally synchronize system time with a userspace PTP stack such
|
|
|
+as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt.
|
|
|
+
|
|
|
+2.1.1 Transmit timestamps with MSG_ERRQUEUE
|
|
|
+
|
|
|
+For transmit timestamps the outgoing packet is looped back to the
|
|
|
+socket's error queue with the send timestamp(s) attached. A process
|
|
|
+receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE
|
|
|
+set and with a msg_control buffer sufficiently large to receive the
|
|
|
+relevant metadata structures. The recvmsg call returns the original
|
|
|
+outgoing data packet with two ancillary messages attached.
|
|
|
+
|
|
|
+A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR
|
|
|
+embeds a struct sock_extended_err. This defines the error type. For
|
|
|
+timestamps, the ee_errno field is ENOMSG. The other ancillary message
|
|
|
+will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This
|
|
|
+embeds the struct scm_timestamping.
|
|
|
+
|
|
|
+
|
|
|
+2.1.1.2 Timestamp types
|
|
|
+
|
|
|
+The semantics of the three struct timespec are defined by field
|
|
|
+ee_info in the extended error structure. It contains a value of
|
|
|
+type SCM_TSTAMP_* to define the actual timestamp passed in
|
|
|
+scm_timestamping.
|
|
|
+
|
|
|
+The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_*
|
|
|
+control fields discussed previously, with one exception. For legacy
|
|
|
+reasons, SCM_TSTAMP_SND is equal to zero and can be set for both
|
|
|
+SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It
|
|
|
+is the first if ts[2] is non-zero, the second otherwise, in which
|
|
|
+case the timestamp is stored in ts[0].
|
|
|
+
|
|
|
+
|
|
|
+2.1.1.3 Fragmentation
|
|
|
+
|
|
|
+Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
|
|
|
+explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
|
|
|
+then only the first fragment is timestamped and returned to the sending
|
|
|
+socket.
|
|
|
+
|
|
|
+
|
|
|
+2.1.1.4 Packet Payload
|
|
|
+
|
|
|
+The calling application is often not interested in receiving the whole
|
|
|
+packet payload that it passed to the stack originally: the socket
|
|
|
+error queue mechanism is just a method to piggyback the timestamp on.
|
|
|
+In this case, the application can choose to read datagrams with a
|
|
|
+smaller buffer, possibly even of length 0. The payload is truncated
|
|
|
+accordingly. Until the process calls recvmsg() on the error queue,
|
|
|
+however, the full packet is queued, taking up budget from SO_RCVBUF.
|
|
|
+
|
|
|
+
|
|
|
+2.1.1.5 Blocking Read
|
|
|
+
|
|
|
+Reading from the error queue is always a non-blocking operation. To
|
|
|
+block waiting on a timestamp, use poll or select. poll() will return
|
|
|
+POLLERR in pollfd.revents if any data is ready on the error queue.
|
|
|
+There is no need to pass this flag in pollfd.events. This flag is
|
|
|
+ignored on request. See also `man 2 poll`.
|
|
|
+
|
|
|
+
|
|
|
+2.1.2 Receive timestamps
|
|
|
+
|
|
|
+On reception, there is no reason to read from the socket error queue.
|
|
|
+The SCM_TIMESTAMPING ancillary data is sent along with the packet data
|
|
|
+on a normal recvmsg(). Since this is not a socket error, it is not
|
|
|
+accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case,
|
|
|
+the meaning of the three fields in struct scm_timestamping is
|
|
|
+implicitly defined. ts[0] holds a software timestamp if set, ts[1]
|
|
|
+is again deprecated and ts[2] holds a hardware timestamp if set.
|
|
|
+
|
|
|
+
|
|
|
+3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
|
|
|
|
|
|
Hardware time stamping must also be initialized for each device driver
|
|
|
that is expected to do hardware time stamping. The parameter is defined in
|
|
@@ -167,8 +372,7 @@ enum {
|
|
|
*/
|
|
|
};
|
|
|
|
|
|
-
|
|
|
-DEVICE IMPLEMENTATION
|
|
|
+3.1 Hardware Timestamping Implementation: Device Drivers
|
|
|
|
|
|
A driver which supports hardware time stamping must support the
|
|
|
SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
|