Bug #400

Fwd: [RFC PATCH] tcp: limit data skbs in qdisc layer

Added by David Taht on Jul 5, 2012. Updated on Apr 11, 2013.
Closed Normal Dave Täht

Description

———- Forwarded message ———-
From: Eric Dumazet eric.dumazet@gmail.com
Date: Wed, Jul 4, 2012 at 6:11 AM
Subject: [RFC PATCH] tcp: limit data skbs in qdisc layer
To: David Miller davem@davemloft.net
Cc: Yuchung Cheng ycheng@google.com, Dave Taht
dave.taht@gmail.com, netdev netdev@vger.kernel.org,
codel@lists.bufferbloat.net, Tom Herbert therbert@google.com, Matt
Mathis mattmathis@google.com, Nandita Dukkipati
nanditad@google.com, Neal Cardwell ncardwell@google.com, Andrew
McGregor andrewmcgr@gmail.com

On Fri, 2012-06-29 at 06:51 +0200, Eric Dumazet wrote:

> My long term plan is to reduce number of skbs queued in Qdisc for TCP
> stack, to reduce RTT (removing the artificial RTT bias because of local
> queues)

preliminar patch to give the rough idea :

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~4 segments [1] per tcp socket in qdisc layer at a
given time. (if TSO is enabled, then a single TSO packet hits the limit)

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() is called for the already
queued skbs.

Note : While this lowers artificial cwnd and rtt, this means more work
has to be done by tx completion handler (softirq instead of preemptable
process context)

To reduce this possible source of latency, we can skb_try_orphan(skb) in
NIC drivers ndo_start_xmit() instead of waiting TX completion, but do
this orphaning only on BQL enabled drivers, or in drivers known to
possibly delay TX completion (NIU is an example, as David had to revert
BQL because of this delaying)

results on my dev machine (tg3 nic) are really impressive, using
standard pfifo_fast, and with or without TSO/GSO

I no longer have 3MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

tcp_write_xmit() call is probably very naive and lacks proper tests.

Note :

[1] because sk_wmem_alloc accounts skb truesize, mss*4 test allow less
then 4 frames, but it seems ok.

drivers/net/ethernet/broadcom/tg3.c | 1
include/linux/skbuff.h | 6 **+
include/net/sock.h | 1
net/ipv4/tcp_output.c | 44 **+-
4 files changed, 51 insertions(+), 1 deletion(-)

diff –git a/drivers/net/ethernet/broadcom/tg3.c
b/drivers/net/ethernet/broadcom/tg3.c
index e47ff8b..ce0ca96 100644
— a/drivers/net/ethernet/broadcom/tg3.c
**+ b/drivers/net/ethernet/broadcom/tg3.c
@ -7020,6 +7020,7@ static netdev_tx_t tg3_start_xmit(struct sk_buff
*skb, struct net_device *dev)

skb_tx_timestamp(skb);
netdev_tx_sent_queue(txq, skb->len);
+ skb_try_orphan(skb);

/* Sync BD data before updating mailbox */
wmb();
diff –git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 885c9bd..c4d4d15 100644
— a/include/linux/skbuff.h
**+ b/include/linux/skbuff.h
@ -1667,6 +1667,12@ static inline void skb_orphan(struct sk_buff *skb)
skb->sk = NULL;
}

+static inline void skb_try_orphan(struct sk_buff *skb)
+{
+ if (!skb_shinfo(skb)->tx_flags)
+ skb_orphan(skb);
+}
+
/**
* __skb_queue_purge - empty a list
* list: list to empty diff --git a/include/net/sock.h b/include/net/sock.h index 640432a..2ce17b1 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @ -1389,6 +1389,7 @@ extern void release_sock(struct sock *sk);

/* BH context may only use the following locking interface. */
#define bh_lock_sock(_sk) spin_lock(&((_sk)->sk_lock.slock))
+#define bh_trylock_sock(_sk) spin_trylock(&((_sk)->sk_lock.slock))
#define bh_lock_sock_nested(__sk) \
spin_lock_nested(&((__sk)->sk_lock.slock), \
SINGLE_DEPTH_NESTING)
diff –git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c465d3e..4e6ef82 100644
— a/net/ipv4/tcp_output.c
**+ b/net/ipv4/tcp_output.c
@ -65,6 +65,8@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);

+static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ int push_one, gfp_t gfp);

/* Account for new data that has been sent to the network. */
static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
@ -783,6 +785,34@ static unsigned int
tcp_established_options(struct sock *sk, struct sk_buff *skb
return size;
}

+/*
+ * Write buffer destructor automatically called from kfree_skb.
+ */
+void tcp_wfree(struct sk_buff *skb)
+{
+ struct sock *sk = skb->sk;
+ unsigned int len = skb->truesize;
+
+ atomic_sub(len - 1, &sk->sk_wmem_alloc);
+ if (bh_trylock_sock(sk)) {
+ if (!sock_owned_by_user(sk)) {
+ if ((1 << sk->sk_state) &
+ (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
+ tcp_write_xmit(sk,
+ tcp_current_mss(sk),
+ 0, 0,
+ GFP_ATOMIC);
+ } else {
+ /* TODO : might signal something here
+ * so that user thread can call tcp_write_xmit()
+ */
+ }
+ bh_unlock_sock(sk);
+ }
+
+ sk_free(sk);
+}
+
/* This routine actually transmits TCP packets queued in by
* tcp_do_sendmsg(). This is used by both the initial
* transmission and possible later retransmissions.
@ -844,7 +874,11@ static int tcp_transmit_skb(struct sock *sk,
struct sk_buff *skb, int clone_it,

skb_push(skb, tcp_header_size);
skb_reset_transport_header(skb);
- skb_set_owner_w(skb, sk);
+
+ skb_orphan(skb);
+ skb->sk = sk;
+ skb->destructor = tcp_wfree;
+ atomic_add(skb->truesize, &sk->sk_wmem_alloc);

/* Build TCP header and checksum it. */
th = tcp_hdr(skb);
@ -1780,6 +1814,14@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
while ((skb = tcp_send_head(sk))) {
unsigned int limit;

  • /* yes, sk_wmem_alloc accounts skb truesize, so mss_now * 4
  • * is a lazy approximation of our needs, but it seems ok
  • */
  • if (atomic_read(&sk->sk_wmem_alloc) >= mss_now * 4) {
  • pr_debug(“here we stop sending frame because
    sk_wmem_alloc %d\n”,
  • atomic_read(&sk->sk_wmem_alloc));
  • break;
  • }
    tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
    BUG_ON(!tso_segs);

History

Updated by David Taht on Oct 26, 2012.
TCP small queuses stuff is upstream in 3.6 for this now
Updated by David Taht on Apr 11, 2013.

This is a static export of the original bufferbloat.net issue database. As such, no further commenting is possible; the information is solely here for archival purposes.
RSS feed

Recent Updates

Jan 3, 2025 Wiki page
Bufferbloat FAQs
Dec 2, 2024 Wiki page
What Can I Do About Bufferbloat?
Jul 21, 2024 Wiki page
cake-autorate
Jul 21, 2024 Wiki page
Tests for Bufferbloat
Jul 1, 2024 Wiki page
RRUL Chart Explanation

Find us elsewhere

Bufferbloat Mailing Lists
#bufferbloat on Twitter
Google+ group
Archived Bufferbloat pages from the Wayback Machine

Sponsors

Comcast Research Innovation Fund
Nlnet Foundation
Shuttleworth Foundation
GoFundMe

Bufferbloat Related Projects

OpenWrt Project
Congestion Control Blog
Flent Network Test Suite
Sqm-Scripts
The Cake shaper
AQMs in BSD
IETF AQM WG
CeroWrt (where it all started)

Network Performance Related Resources


Jim Gettys' Blog - The chairman of the Fjord
Toke's Blog - Karlstad University's work on bloat
Voip Users Conference - Weekly Videoconference mostly about voip
Candelatech - A wifi testing company that "gets it".