Retrieve TCP state details for an open connection without tcpdump?

A

Aaron E

Guest
I have an environment with a few hundred Linux (CentOS/RedHat) servers that all interact with each other through TCP connections. Above the TCP layer is an RPC protocol that has timeouts built into it such that the application will wait for about 10 seconds for a response after sending an RPC request before it considers the request to have timed out and the application takes alternate/corrective action to handle the timeout. I have been observing very intermittently that the RPC layer is logging a timeout, meaning the RPC layer on one end of the connection has sent an request to the other end but an RPC response has not yet ben received.

I am still trying to isolate whether the issue is at the TCP or RPC layer.

To investigate the TCP layer, I've been gathering samples of "netstat -pan" output on every server in the environment, every few seconds, over the course of a few days.

What I've noticed is that when I have a timeout for a request sent from server A to server B, the send queue size is non-zero (as shown by netstat -pan) for the TCP connection on server B. I am presuming that the RPC response has been partially or fully placed into the TCP send queue at that point but for some reason is not being transmitted.

I want to confirm my suspicion and dive further into the root cause. For this type of issue, my go-to is using tcpdump to capture the traffic from both ends of the connection and then analyze it with WireShark to understand the behavior. Unfortunately, given the number of servers involved, the intermittent nature of the issue, and the very high rate of traffic flow (well utilized 10GbE), it is not feasible to run tcpdump for an extended period of time across all machines in the environment.

As the send queue size is staying the same non-zero value for >10 seconds, what I'd like to do is a simple poll of the TCP state from all open connections on a machine every few seconds. Specifically, what I'd like to retrieve for each TCP connection is:

1. The last window size and scale factor advertised by the server
2. The highest acknowledgement number sent by the server for the connection along with any unacknowledged extents (as selective ack is enabled)
3. The next sequence number the server will use to transmit on the connection
4. Whether the TCP socket is corked (as the application uses TCP_CORK)

From this information, I should be able to identify if data has been transmitted but not acknowledged, if the window is exhausted and hence no more data can be sent, or if one of the servers is hitting an application layer bug where the TCP cork is in place and thus the RPC response is being blocked by some logical error in the application.

Unfortunately, I've looked around and could not find a *convenient* way to retrieve such information. I've checked through all the info I could find on the netstat and ss commands, and I've looked through a lot of /proc content but no luck.

I presume something like systemtap could be used, or a custom kernel could be built. But neither of those are particularly attractive options unless there is nothing else that can be done to retrieve the TCP state information for each open connection from a user space process.

Any ideas?
 


Besides using SS (which you seem to already have tried. Did you check every possible option with ss? ) the only real alternative is /proc/net/tcp which, as the above link says, is very slow and not really usable.

Maybe the Linux kernel TCP man page will help?

ss processes through various contents of /proc but definitely does not print TCP layer details like the sequence/acknowledgement/selective acknowledgement details. This content is not accessible through /proc/net/tcp.

As far as I can tell, this content is not accessible through /proc at all and would require retreiving addresses for each socket and retrieving the details from the network stack in kernel space. I was looking for a canned utility which does this so I don't need to go down the route of customizing and rebuilding the kernel or developing a systemtap for it.
 
After significant research, I concluded the answer is "no", from user space you can not retrieve TCP state details like the next SYN, most recent ACK, number of retransmitted packets in flight, etc.

Loading a module into kernel space to debug is really the only way to find TCP retransmits (e.g. packet loss) without statefully inspecting every TCP packet traversing the network interfaces of a server such as through analysis of packet captures. At least, that is the only way using the 2.6.32 kernel, I did not inspect newer kernels.

If you end up in this boat and want to identify whether TCP connections on your server are retransmitting packets, you can use the following systemtap script:



Code:
%{
#include <linux/version.h>
#include <net/sock.h>
#include <net/tcp.h>
#include <net/ip.h>
#include <linux/skbuff.h>
%}

/* Number of retransmitted packets not yet acknowledged */
function tcp_get_info_retrans_out:long(sock:long)
%{ /* pure */
        struct sock *sk = (struct sock *)(long) STAP_ARG_sock;
        struct tcp_sock *tp = tcp_sk(sk);
        STAP_RETVALUE = (int64_t) kread(&(tp->retrans_out));
        CATCH_DEREF_FAULT();
%}

/* Timestamp of the start of last retransmit */
function tcp_get_info_retrans_stamp:long(sock:long)
%{ /* pure */
        struct sock *sk = (struct sock *)(long) STAP_ARG_sock;
        struct tcp_sock *tp = tcp_sk(sk);
        STAP_RETVALUE = (int64_t) kread(&(tp->retrans_stamp));
        CATCH_DEREF_FAULT();
%}

/* Timestamp of last sent data packet */
function tcp_get_info_lsndtime:long(sock:long)
%{ /* pure */
        struct sock *sk = (struct sock *)(long) STAP_ARG_sock;
        struct tcp_sock *tp = tcp_sk(sk);
        STAP_RETVALUE = (int64_t) kread(&(tp->lsndtime));
        CATCH_DEREF_FAULT();
%}

/* Last byte of the most recently transmitted small packet */
function tcp_get_info_snd_sml:long(sock:long)
%{ /* pure */
        struct sock *sk = (struct sock *)(long) STAP_ARG_sock;
        struct tcp_sock *tp = tcp_sk(sk);
        STAP_RETVALUE = (int64_t) kread(&(tp->snd_sml));
        CATCH_DEREF_FAULT();
%}

/* Timestamp of last received ack */
function tcp_get_info_rcv_tstamp:long(sock:long)
%{ /* pure */
        struct sock *sk = (struct sock *)(long) STAP_ARG_sock;
        struct tcp_sock *tp = tcp_sk(sk);
        STAP_RETVALUE = (int64_t) kread(&(tp->rcv_tstamp));
        CATCH_DEREF_FAULT();
%}


function tcp_get_info_dest_ip:long(sock:long)
%{ /* pure */
        struct sock *sk = (struct sock *)(long) STAP_ARG_sock;
        struct tcp_sock *tp = tcp_sk(sk);
        struct inet_connection_sock *ics = (struct inet_connection_sock *)&(tp->inet_conn);
    struct inet_sock *is = (struct inet_sock *)&(ics->icsk_inet);
    STAP_RETVALUE = (int32_t) kread(&(is->daddr));
        CATCH_DEREF_FAULT();
%}

function tcp_get_info_src_ip:long(sock:long)
%{ /* pure */
        struct sock *sk = (struct sock *)(long) STAP_ARG_sock;
        struct tcp_sock *tp = tcp_sk(sk);
        struct inet_connection_sock *ics = (struct inet_connection_sock *)&(tp->inet_conn);
        struct inet_sock *is = (struct inet_sock *)&(ics->icsk_inet);
        STAP_RETVALUE = (int32_t) kread(&(is->saddr));
        CATCH_DEREF_FAULT();
%}

function tcp_get_info_dest_port:long(sock:long)
%{ /* pure */
        struct sock *sk = (struct sock *)(long) STAP_ARG_sock;
        struct tcp_sock *tp = tcp_sk(sk);
        struct inet_connection_sock *ics = (struct inet_connection_sock *)&(tp->inet_conn);
        struct inet_sock *is = (struct inet_sock *)&(ics->icsk_inet);
        STAP_RETVALUE = (int32_t) kread(&(is->dport));
        CATCH_DEREF_FAULT();
%}

function tcp_get_info_src_port:long(sock:long)
%{ /* pure */
        struct sock *sk = (struct sock *)(long) STAP_ARG_sock;
        struct tcp_sock *tp = tcp_sk(sk);
        struct inet_connection_sock *ics = (struct inet_connection_sock *)&(tp->inet_conn);
        struct inet_sock *is = (struct inet_sock *)&(ics->icsk_inet);
        STAP_RETVALUE = (int32_t) kread(&(is->sport));
        CATCH_DEREF_FAULT();
%}

probe kernel.function("tcp_retransmit_skb@net/ipv4/tcp_output.c").return
{
        sock = (@defined($sock) ? $sock->sk : $sk);
        retrans = tcp_get_info_retrans_out(sock);
        if(retrans != 0){
                src_ip = tcp_get_info_src_ip(sock) - 18446744069414584320;
                src_port = tcp_get_info_src_port(sock);
                dest_ip = tcp_get_info_dest_ip(sock) - 18446744069414584320;
                dest_port = tcp_get_info_dest_port(sock);
                retranstime = tcp_get_info_retrans_stamp(sock);
        lastsendtime = tcp_get_info_lsndtime(sock);
        lastacktime = tcp_get_info_rcv_tstamp(sock);
                lastsentbyte = tcp_get_info_snd_sml(sock);
        diff = lastsendtime - retranstime;
                printf("%d %X:%X %X:%X unrespTm %d retrInProg %d retrStart %d lastSend %d lastAck %d nextSyn %X\n", gettimeofday_us(), src_ip, src_port, dest_ip, dest_port, diff, retrans, retranstime, lastsendtime, lastacktime, lastsentbyte);
        }
}






This systemtap script will generate output like:

1433989010361557 F1480A0A:F9E0 F2480A0A:8913 unrespTm 0 retrInProg 1 retrStart 2239333976 lastSend 2239333976 lastAck 2239333771 nextSyn 9D684603
1433989016511563 F1480A0A:F9E0 F2480A0A:8913 unrespTm 6150 retrInProg 1 retrStart 2239333976 lastSend 2239340126 lastAck 2239333771 nextSyn 9D684603


The fields are:
1. Time since epoch (in microseconds)
2. Local IP:Local port (in octets, in little endian notation)
3. Remote IP:Remote port (in octets, in little endian notation)
4. Time since the given connection stopped working (in milliseconds)
5. Number of packets for the given connection for which retransmission is in progress
6. Time at which the most recent packet was first retransmitted (in milliseconds, since networking was started on the system)
7. Time at which the most recent retransmission was attempted (in milliseconds, since networking was started on the system)
8. Time at which the most recent ACK was received from remote system (in milliseconds, since networking was started on the system)
9. The SYN the system will use the next time new data is transmitted over the connection

The IP/port part should be straightforward. The remaining fields tell us about the packet loss that is happening. Specifically, we can infer that a connection has been "stuck" for the time delta between when the first retransmission attempt for a packet occurred to when the most recent retransmission attempt occurred, provided the time at which the most recent ACK was received or the next SYN to use does not change during that time.

For example, this line:
1433989016511563 F1480A0A:F9E0 F2480A0A:8913 unrespTm 6150 retrInProg 1 retrStart 2239333976 lastSend 2239340126 lastAck 2239333771 nextSyn 9D684603

This tells us that at time 1433989016511563, the connection between 10.10.72.241:57593 (the machine where this message was logged) and 10.10.72.242:5001 (the remote machine) has not been working for 6.150 seconds. There is 1 packet which is currently being retransmitted between the systems. The time the retransmit was first attempted for this packet is 2239333976 while the most recent attempt was at 2239340126, which again yields 6150(ms). The last time an ACK was received from the remote system was at 2239333771 which is earlier than 2239333976, so we can infer that no ACK has been received from the remote system during this entire time that retransmissions have been occuring. Further, if you compare the earlier sample against this most recent sample, you can see the next SYN has stayed the same at 9D684603, which implies that the local machine has not tried to send any new data to the remote machine since these retransmits have started.
 
Top