Bug #332

the new SFQ with low limits and many flows exhibits starvation issues

Added by Dave Täht on Jan 30, 2012. Updated on Apr 19, 2012.
Closed Urgent David Taht

Description

I am getting some reports of SFQ’s new ‘head drop’ orientation causing starvation.

More details to come

History

Updated by David Taht on Jan 30, 2012.
———- Forwarded message ———-
From: Jesper Dangaard Brouer jdb@comx.dk
Date: Thu, Jan 26, 2012 at 10:38 AM
Subject: Problem with the SFQ patch: dont put new flow at the end of flows
To: Eric Dumazet eric.dumazet@gmail.com
Cc: Dave Taht dave.taht@gmail.com

Hi Eric
(Cc Dave Taht)

I think I have found a problem with your SFQ patch:
 sch_sfq: dont put new flow at the end of flows (d47a0ac7b66 net-next).

Do you want to discuss this on netdev or offlist?

My preliminary finding is, that is works really nice, until I reach the
queue “limit”, then new flow almost cannot get through.  With the patch
reverted this is not a problem, although of-cause with significantly
higher latency.

I’m currently running v3.2.1-stable with you and other patches applied.
I will re-run my tests on a clean net-next soon (just need my SFP+ patch
on top)… just to verify the results.

My setup is a HTB root node with a 2Mbit/s bandwidth limit, and a SFQ
qdisc attached, which have limit 10.
(This is on a 10Gbit/s NIC, thats not a problem, right?)

albpd42:~# tc class ls dev eth61
class htb 1: root leaf 42: prio 0 rate 2000Kbit ceil 2000Kbit burst
1600b cburst 1600b

albpd42:~# tc qdisc ls dev eth61
qdisc htb 1: root refcnt 65 r2q 10 default 0 direct_packets_stat 0
qdisc sfq 42: parent 1: limit 10p quantum 1514b

The test that revels the problem is fairly simple:

1. xterm running “ping -nA 10.10.61.2” as non-root user, thus minimal
ping interval is 200msec.

2. On sink 10.10.61.2 run “iperf -s -i10”

3. On source 10.10.61.1, I run: “iperf -c 10.10.61.2 -t 1000 -i 2 -P XXX”

If the number of parallel connection is below 10 (-P 10), then is works
very good, with very low ping times.

If the number of parallel connection is above 10, that is larger than
the SFQ limit on 10, things turn bad… and the ping drops almost all
packets.   With the patch reverted, ping packets get through, with a
higher latency, but also with some dropped packets (but no as bad as the
other case).

I don’t know the SFQ code well enough to fix the issue my self, so I’m
asking for you guidance…

Another corrolation is, that the problem occurs, when the following
command reach the SFQ limit of 10.

 tc class ls dev eth61 | grep ‘sfq’ | wc -l

Updated by David Taht on Jan 30, 2012.
———- Forwarded message ———-
From: Jesper Dangaard Brouer jdb@comx.dk
Date: Thu, Jan 26, 2012 at 11:57 AM
Subject: Re: Problem with the SFQ patch: dont put new flow at the end of flows
To: Eric Dumazet eric.dumazet@gmail.com
Cc: Dave Taht dave.taht@gmail.com

On Thu, 2012-01-26 at 11:03 +0100, Eric Dumazet wrote:
> Le jeudi 26 janvier 2012 à 10:38 +0100, Jesper Dangaard Brouer a écrit :
[…]
> > Another corrolation is, that the problem occurs, when the following
> > command reach the SFQ limit of 10.
> >
> >  tc class ls dev eth61 | grep ‘sfq’ | wc -l
> >
>
> Hi Jesper
>
> If you have more than 10 flows and a queue limit of 10, then each time a
> new packet is enqueued, an ‘old packet in queue’ must be chosen and
> dropped.
>
> This ‘old packet’ is chosen, using the “longest flow”.
>
> But if all flows have one packet, then what can we do …

I think this is the exact case, all flows have one packet.
This is also given by the “tc | wc -l” command.

It can be the “ping packet” by luck. Why would it be more important than
an other ?

Because it represent a new connection establishment.

# vi +341 net/sched/sch_sfq.c
        if (d == 1) {
                /* It is difficult to believe, but ALL THE SLOTS HAVE LENGTH 1. */
                x = q->tail->next;
                slot = &q->slots[x];
                q->tail->next = slot->next;
                q->ht[slot->hash] = SFQ_EMPTY_SLOT;
                goto drop;
        }

This part of the code really takes whatever packet, and changing it might
solve your problem but is it a real one ?

This is an oversimplified example,

There is no guarantee the ‘ping packet’ cannot be dropped, with old SFQ or new SFQ.

The probability of dropping is just significantly higher in the new
SFQ.

On Thu, 2012-01-26 at 11:22 +0100, Eric Dumazet wrote:
Le jeudi 26 janvier 2012 à 11:03 +0100, Eric Dumazet a écrit :
>
> By the way, above code implements a head-drop :)

Aha - thats why the ping/new-conn packets are getting dropped in the new
SFQ, as new flows are put a head in the queue!

Mostbly because a tail-drop would be very expensive : We would need to
add a double linked list instead of a single link, or scan the whole
list to find the element right before the tail.

Yes, we want/need an inexpensive operation here.
I just think its problematic that new flows will suffer.

The real-life case that I’m worried about with this drop behaviour, is
the fairness among flows, e.g. TCP flows.  Once we reach the queue
limit, number of TCP flows, no new TCP flows can be established, thus
being unfair to new flows.  Guess, I have to find a test showing this…

Updated by David Taht on Jan 30, 2012.
———- Forwarded message ———-
From: Eric Dumazet eric.dumazet@gmail.com
Date: Thu, Jan 26, 2012 at 3:33 PM
Subject: Re: Problem with the SFQ patch: dont put new flow at the end of flows
To: jdb@comx.dk
Cc: Dave Taht dave.taht@gmail.com

Le jeudi 26 janvier 2012 à 15:04 +0100, Jesper Dangaard Brouer a écrit :
> On Thu, 2012-01-26 at 14:50 +0100, Jesper Dangaard Brouer wrote:
> > > Ah. I was not running with a limit that low, but at 50-300.
> >
> > With a packet limit <= 128, the results should be reproduce-able, as
> > the “depth” is fixed on 128.
> >
> > I’m using a too old “tc” to adjust the “depth”.
>
> Just, git downloaded the latest “tc”.
>
> And I think I mean “flows” instead of “depth” in the above statements,
> sorry! (after reading the just committed manual update by Eric ;-))
>
> I’m still a little confused.  How is the “flows” related to the hash
> table size “divisor”?
> Should I reduce “divisor” to 256, to save memory, as in my setup I have
> a sfq per customer (per interface). As, I guess I can live with some
> collisions with 127 flows?
>
>
>

Hmm, if you reduce hash divisor, you also favor hash collisions.

It all depends on the memory you want to save I would say.

1024 divisor is quite good for 127 flows I guess.

Updated by David Taht on Jan 30, 2012.
———- Forwarded message ———-
From: Jesper Dangaard Brouer jdb@comx.dk
Date: Mon, Jan 30, 2012 at 11:17 PM
Subject: Re: Problem with the SFQ patch: dont put new flow at the end of flows
To: Eric Dumazet eric.dumazet@gmail.com
Cc: Dave Taht dave.taht@gmail.com

On Thu, 2012-01-26 at 12:12 +0100, Eric Dumazet wrote:

To avoid this problem, you can limit number of flows to N and number
of packets to N+1 : This way there is always one flow with 2 packets.

That does help, with packets = flows+1, but its not stable and
“falls-back” into the bad mode (of a single flow winning).  I have to
increase it some more to get good results.

E.g. limit 20 and flows 10, allows 10 TCP flows to share the bandwidth,
BUT the ping program does NOT get any packets through (thus, no new
connections can be established, which is bad).

Quite frankly, SFQ should not be limited to 10 packets.

I can provoke the exact same behaviour with the default 127 packets, and
129 TCP flows (thrulay -m 129 10.10.61.2).  Only a single flow out of
129 gets any throughput.

Side note:
I noticed that adding SFQ without any params, will give limit 127p and
flows 128.

albpd42:~# ~jdb/git/iproute2/tc/tc -d -s q ls dev eth61
qdisc htb 1: root refcnt 65 r2q 10 default 2 direct_packets_stat 0 ver 3.17
 Sent 405048608 bytes 293646 pkt (dropped 74779, overlimits 660160 requeues 0)
 backlog 0b 127p requeues 0
qdisc sfq 2: parent 1:2 limit 127p quantum 1514b depth 127 flows
1281024 divisor 1024
 Sent 12350969 bytes 8748 pkt (dropped 3369, overlimits 0 requeues 0)
 backlog 189382b 127p requeues 0

But I cannot create the same manually config by:

albpd42:~# ~jdb/git/iproute2/tc/tc qdisc del dev eth61 parent 1:2 handle 2:
albpd42:~# ~jdb/git/iproute2/tc/tc qdisc add dev eth61 parent 1:2
handle 2: sfq limit 127 flows 128
albpd42:~# ~jdb/git/iproute2/tc/tc -d -s q ls dev eth61
qdisc htb 1: root refcnt 65 r2q 10 default 2 direct_packets_stat 0 ver 3.17
 Sent 410090878 bytes 297341 pkt (dropped 75611, overlimits 668278 requeues 0)
 backlog 0b 0p requeues 0
qdisc sfq 2: parent 1:2 limit 127p quantum 1514b depth 127 flows
1271024 divisor 1024
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

Guess it makes sense that I’m not allowed to create more flows than
packets, but the default config does that… strange.

Updated by David Taht on Jan 30, 2012.
———- Forwarded message ———-
From: Jesper Dangaard Brouer jdb@comx.dk
Date: Thu, Jan 26, 2012 at 1:15 PM
Subject: Re: Problem with the SFQ patch: dont put new flow at the end of flows
To: Eric Dumazet eric.dumazet@gmail.com
Cc: Dave Taht dave.taht@gmail.com

On Thu, 2012-01-26 at 11:57 +0100, Jesper Dangaard Brouer wrote:
> The real-life case that I’m worried about with this drop behaviour, is
> the fairness among flows, e.g. TCP flows.  Once we reach the queue
> limit, number of TCP flows, no new TCP flows can be established, thus
> being unfair to new flows.  Guess, I have to find a test showing
> this…

The fairness among flows seems to be broken with the patch.

In using the tool “thrulay” as it gives a nice output per TCP
connection. The important column is “Mb/s” (is also have RTT which is
also very useful, when troubleshooting latency/queue issues, but its not
important here).

Command starting 12 TCP flows:
 albpd42:~# thrulay -m 12 10.10.61.2

I have only included the sum after 60 sec test.

First test without patch:

  1. local window = 262142B; remote window = 262142B
  2. block size = 8192B
  3. MTU: 1500B; MSS: 1448B; Topology guess: Ethernet/PPP
  4. MTU = getsockopt(IP_MTU); MSS = getsockopt(TCP_MAXSEG)
  5. test duration = 60s; reporting interval = 1s
    [… cut …]
    #(ID) begin, s   end, s  Mb/s    RTT, ms: min  avg     max
    #( 0)    0.000   60.000    0.085  614.475 10510.368 37648.921
    #( 1)    0.000    0.000    0.000    0.000    0.000    0.000
    #( 2)    0.000   60.000    0.001 49749.240 49749.240 49749.240
    #( 3)    0.000   60.000    0.198  389.384 4483.202 7644.653
    #( 4)    0.000   60.000    0.167  401.534 5262.690 16491.171
    #( 5)    0.000   60.000    0.201  413.688 4410.953 9893.164
    #( 6)    0.000   60.000    0.167  438.016 5293.815 10324.560
    #( 7)    0.000   60.000    0.226  468.394 3982.983 5716.139
    #( 8)    0.000   60.000    0.218  486.621 4117.844 6094.251
    #( 9)    0.000   60.000    0.189  517.020 4790.583 10386.741
    #(10)    0.000   60.000    0.209  480.531 4331.842 7406.696
    #(11)    0.000   60.000    0.223  504.839 4001.091 5681.422
    #(****)    0.000   60.000    1.884  389.384 8411.218 49749.240

Notice the fairshare between the flows.
Also notice that flow id (1) has an “end,s” of 0,  I think than means
that it was not able establish the connection.

Second test with the patch:

albpd42:~# thrulay -m 12 10.10.61.2

  1. local window = 262142B; remote window = 262142B
  2. block size = 8192B
  3. MTU: 1500B; MSS: 1448B; Topology guess: Ethernet/PPP
  4. MTU = getsockopt(IP_MTU); MSS = getsockopt(TCP_MAXSEG)
  5. test duration = 60s; reporting interval = 1s
    #(ID) begin, s   end, s  Mb/s    RTT, ms: min  avg     max
    #( 0)    0.000    0.000    0.000    0.000    0.000    0.000
    #( 1)    0.000    0.000    0.000    0.000    0.000    0.000
    #( 2)    0.000    0.000    0.000    0.000    0.000    0.000
    #( 3)    0.000    0.000    0.000    0.000    0.000    0.000
    #( 4)    0.000   60.000    1.902  127.884  489.294  583.101
    #( 5)    0.000    0.000    0.000    0.000    0.000    0.000
    #( 6)    0.000    0.000    0.000    0.000    0.000    0.000
    #( 7)    0.000    0.000    0.000    0.000    0.000    0.000
    #( 8)    0.000    0.000    0.000    0.000    0.000    0.000
    #( 9)    0.000    0.000    0.000    0.000    0.000    0.000
    #(10)    0.000    0.000    0.000    0.000    0.000    0.000
    #(11)    0.000    0.000    0.000    0.000    0.000    0.000
    #(****)    0.000   60.000    1.902  127.884   40.774  583.101

This result show a very unfair sharing.  Only one TCP flow are getting
data through! – I expected to see a more unfair behaviour towards new
flows, but this is a bit extreme… something else must be wrong! Cannot
understand this change should make it that unfair!

I have tried to re-run the tests several times.  I have also tested with
iperf, and I get similar results:

[ 14]  0.0-60.0 sec  13.6 MBytes  1.90 Mbits/sec
[ 12]  0.0-60.1 sec  56.0 KBytes  7.64 Kbits/sec
[  9]  0.0-60.1 sec  16.0 KBytes  2.18 Kbits/sec
[  7]  0.0-60.2 sec  16.0 KBytes  2.18 Kbits/sec
[  6]  0.0-60.2 sec  16.0 KBytes  2.18 Kbits/sec
[SUM]  0.0-60.2 sec  14.2 MBytes  1.98 Mbits/sec
[  5]  0.0-60.4 sec  16.0 KBytes  2.17 Kbits/sec
[  3]  0.0-60.5 sec  16.0 KBytes  2.17 Kbits/sec
[ 11]  0.0-60.5 sec  16.0 KBytes  2.17 Kbits/sec
[  8]  0.0-74.5 sec  16.0 KBytes  1.76 Kbits/sec
[ 13]  0.0-79.6 sec  16.0 KBytes  1.65 Kbits/sec

Flow [14] gets 1.9Mbit/s, while the rest gets below 7.64 Kbits/sec.

I don’t understand why this small patch changed the fairness so much?!?

Updated by David Taht on Feb 1, 2012.
Jasper:

I have duplicated your starvation problem, even with sfqred on cerowrt bql-30.

I had hoped that setting the depth would help. It didn’t.

on the 3.3 based laptop:

tc -b <AAA
qdisc del dev eth0 root
qdisc add dev eth0 root handle 1: est 1sec 4sec htb default 1
class add dev eth0 parent 1: classid 1:1 est 1sec 4sec htb rate
4000kbit mtu 1500 quantum 1500
qdisc add dev eth0 parent 1:1 handle a: est 1sec 4sec sfq limit 200
headdrop perturb 60000 flows 500 divisor 16384 redflowlimit 40000
depth 24 min 4500 max 9000 probability 0.20 ecn harddrop
AAA

The cerowrt box is running:

root@OpenWrt:~# tc qdisc show dev se00
qdisc sfq a: root refcnt 2 limit 127p quantum 1514b depth 127 divisor 1024

running 10 iperfs (iperf -t 60 -c 172.30.42.1) from the laptop to the
cerowrt box.

Note the first iperf completely starves out the last 9 connection attempts, so
they don’t even start.

[ 4] 0.0-60.5 sec 27.0 MBytes 3.74 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41471
[ 4] 0.0- 1.6 sec 768 KBytes 3.83 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41472
[ 4] 0.0- 0.4 sec 256 KBytes 5.30 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41473
[ 4] 0.0- 0.4 sec 256 KBytes 5.41 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41474
[ 4] 0.0- 0.4 sec 256 KBytes 5.41 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41475
[ 4] 0.0- 0.4 sec 256 KBytes 5.41 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41476
[ 4] 0.0- 0.4 sec 256 KBytes 5.43 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41479
[ 4] 0.0- 1.5 sec 768 KBytes 4.23 Mbits/sec


For giggles, I tried it with the same sfqred setting on both sides, with the
same sad result.

For the sake of argument, I’m going to try this with a similar qfq setup.

Updated by David Taht on Feb 1, 2012.
Sigh. I went back and installed iperf-mt, rather than iperf, rebuilt a kernel
for cerowrt with CONFIG_NO_HZ and CONFIG_HZ_256 (which is how I
normally build it)

and can’t trigger it now.

On Wed, Feb 1, 2012 at 11:37 PM, Dave Taht dave.taht@gmail.com wrote:
> Jasper:
>
> I have duplicated your starvation problem, even with sfqred on cerowrt bql-30.
>
> I had hoped that setting the depth would help. It didn’t.
>
> on the 3.3 based laptop:
>
> tc -b qdisc del dev eth0 root
> qdisc add dev eth0 root handle 1: est 1sec 4sec htb default 1
> class add dev eth0 parent 1: classid 1:1 est 1sec 4sec htb rate
> 4000kbit mtu 1500 quantum 1500
> qdisc add dev eth0 parent 1:1 handle a: est 1sec 4sec sfq limit 200
> headdrop perturb 60000 flows 500 divisor 16384 redflowlimit 40000
> depth 24 min 4500 max 9000 probability 0.20 ecn harddrop
> AAA
>
> The cerowrt box is running:
>
> root@OpenWrt:~# tc qdisc show dev se00
> qdisc sfq a: root refcnt 2 limit 127p quantum 1514b depth 127 divisor 1024
>
> running 10 iperfs (iperf -t 60 -c 172.30.42.1) from the laptop to the
> cerowrt box.
>
> Note the first iperf completely starves out the last 9 connection attempts, so
> they don’t even start.
>
>
> [  4]  0.0-60.5 sec  27.0 MBytes  3.74 Mbits/sec
> [  4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41471
> [  4]  0.0- 1.6 sec   768 KBytes  3.83 Mbits/sec
> [  4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41472
> [  4]  0.0- 0.4 sec   256 KBytes  5.30 Mbits/sec
> [  4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41473
> [  4]  0.0- 0.4 sec   256 KBytes  5.41 Mbits/sec
> [  4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41474
> [  4]  0.0- 0.4 sec   256 KBytes  5.41 Mbits/sec
> [  4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41475
> [  4]  0.0- 0.4 sec   256 KBytes  5.41 Mbits/sec
> [  4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41476
> [  4]  0.0- 0.4 sec   256 KBytes  5.43 Mbits/sec
> [  4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41479
> [  4]  0.0- 1.5 sec   768 KBytes  4.23 Mbits/sec
>
> ****
>
> For giggles, I tried it with the same sfqred setting on both sides, with the
> same sad result.
>
> For the sake of argument, I’m going to try this with a similar qfq setup.

Updated by David Taht on Feb 2, 2012.
On Thu, Feb 2, 2012 at 12:05 PM, Jesper Dangaard Brouer jdb@comx.dk wrote:
> On Thu, 2012-02-02 at 01:44 +0000, Dave Taht wrote:
>> Sigh. I went back and installed iperf-mt, rather than iperf, rebuilt a kernel
>> for cerowrt with CONFIG_NO_HZ and CONFIG_HZ_256 (which is how I
>> normally build it)
>>
>> and can’t trigger it now.
>
> That make perfect sense to me.  By running a multi threaded version of
> iperf, the flows/packets gets a chance to be scheduled individually, and
> thus the packets have, a better chance, of not getting
> “grouped-together” in the qdiscs queue.  You further improve this by
> using CONFIG_NO_HZ.
>
> I would still claim, that this situation can still happen for real-life
> applications, which have many TCP connections, and have not implemented
> this in a multi-threaded manor.

Perhaps you mis-understood the ‘sigh’. I have little doubt this problem
can occur. I was happy that I could reproduce it with a non-default
setup for me.

I will continue to try to reproduce. Another thought is merely that non-threaded
iperf will refuse a connection entirely…

As for CONFIG_NO_HZ, it was my impression that HTB performed more
accurately and better with that and hires timers on.

Updated by Dave Täht on Apr 13, 2012.
I have temporarily restored this patch to cerowrt, because I have some hope htb can be fixed….
Updated by Dave Täht on Apr 19, 2012.
I have thus far found it impossible to create bad behavior when sfqred is used properly - flows prime or near prime in the 24-53 range, low min, 200-300 packets, and redflowlimit. I have created a kernel module that can change the behavior from hoq to toq at module load time, and am testing elsewhere.

Pushing forward to head of queue has wonderful effects on mice, dns, and ants.

The behavior I see in shooting elephants, is fine by me.

Updated by Dave Täht on Apr 19, 2012.

This is a static export of the original bufferbloat.net issue database. As such, no further commenting is possible; the information is solely here for archival purposes.
RSS feed

Recent Updates

Jan 3, 2025 Wiki page
Bufferbloat FAQs
Dec 2, 2024 Wiki page
What Can I Do About Bufferbloat?
Jul 21, 2024 Wiki page
cake-autorate
Jul 21, 2024 Wiki page
Tests for Bufferbloat
Jul 1, 2024 Wiki page
RRUL Chart Explanation

Find us elsewhere

Bufferbloat Mailing Lists
#bufferbloat on Twitter
Google+ group
Archived Bufferbloat pages from the Wayback Machine

Sponsors

Comcast Research Innovation Fund
Nlnet Foundation
Shuttleworth Foundation
GoFundMe

Bufferbloat Related Projects

OpenWrt Project
Congestion Control Blog
Flent Network Test Suite
Sqm-Scripts
The Cake shaper
AQMs in BSD
IETF AQM WG
CeroWrt (where it all started)

Network Performance Related Resources


Jim Gettys' Blog - The chairman of the Fjord
Toke's Blog - Karlstad University's work on bloat
Voip Users Conference - Weekly Videoconference mostly about voip
Candelatech - A wifi testing company that "gets it".