Recently I decided to look under the hood to see how exactly srtt is calculated in Linux. Actual (Exponentially Weighted Moving Average) srtt calculation is a rather straight-forward part but what goes in as input to that calculation under various scenarios is interesting and very important in getting correct rtt estimate.

Also useful to note the difference between Linux and FreeBSD in this regard. Linux doesn’t trust tcp packet Timestamps option provided value whenever possible as middle-boxes can meddle with it.

Basic algorithm is: For non-retransmitted packets, use saved packet send timestamp and ack arrival time. For retransmitted packets, use timestamp option and if that’s not enabled, rtt is not calculated for such packets.

Let’s look at the code. I am using net-next. When a TCP sender sends packets, it has to wait for acks for those packets before throwing them away. It stores them in a queue called ‘retransmission queue’. When sent packets get acked, tcp_clean_rtx_queue() gets called to clear those packets from the retransmission queue.

A few useful variables in that function are:

seq_rtt_us – uses first packet from ackd range ca_rtt_us – uses last packet from ackd range (mainly used for congestion control) sack_rtt_us – uses sacked ack tcp_mstamp is a tcp_sock member which represents timestamp of most recent packet received/sent. It gets updated by tcp_mstamp_refresh().

For a clean ack (not sack):

seq_rtt_us = ca_rtt_us (as there is no range)

If such a clean is also for a non-retransmitted packet:

seq_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, first_ackt);

and for a sack which is again for a non-retransmitted packet:

sack_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, sack->first_sackt);

Code that updates sack→first_sackt is in tcp_sacktag_one() where it gets populated when the sack is for a non-retransmitted packet.

tcp_stamp_us_delta() gets the difference with timestamp that the stack maintains.

Now tcp_ack_update_rtt() gets called which starts out with:

/* Prefer RTT measured from ACK’s timing to TS-ECR. This is because
 * broken middle-boxes or peers may corrupt TS-ECR fields. But
 * Karn’s algorithm forbids taking RTT if some retransmitted data
 * is acked (RFC6298).
 */
if (seq_rtt_us < 0)
seq_rtt_us = sack_rtt_us;

For acks acking retransmitted packets, seq_rtt_us would be -ve. But if there is a SACK timestamp from a non-retransmitted packet, it would use that as it carries valid and useful timestamps.

Then it takes TS-opt provided timestamps only if seq_rtt_us is -ve.

if (seq_rtt_us < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
flag & FLAG_ACKED) {
    u32 delta = tcp_time_stamp(tp) – tp->rx_opt.rcv_tsecr;
    u32 delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ);

    seq_rtt_us = ca_rtt_us = delta_us;
}

By this point, there is seq_rtt_us that can be fed into tcp_rtt_estimator() that’d generate smoothed-RTT (which is more or less based on SIGCOMM 88 paper by Van Jacobson).