Bug #442

Wifi interfaces stop transmitting at least some traffic types after much traffic

Added by Dave Täht on Apr 5, 2014. Updated on Oct 7, 2014.
Closed Immediate Dave Täht

Description

We have been seeing a bug in (only wifi) interfaces so far where after a significant amount of
traffic is transferred some traffic (notably tcp synacks, but other forms of traffic too),
start failing to be transmitted back to the STA.

We have thus far ruled out syn flood protection, & 6in4 encapsulation. Some users never see
the problem, others can get it to happen in a few hours.

Attachments

  • local2.cap (application/vnd.tcpdump.pcap; 12.0 kiB) Dave Täht May 14, 2014
  • signature.asc (application/pgp-signature; 497 bytes) Rich Brown Aug 16, 2014

History

Updated by David Taht on Apr 5, 2014.
———- Forwarded message ———-
From: Dave Taht dave.taht@gmail.com
Date: Wed, Apr 2, 2014 at 7:43 PM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.34-4 dev build released
To: Stephen Hemminger stephen@networkplumber.org
Cc: “cerowrt-devel@lists.bufferbloat.net” cerowrt-devel@lists.bufferbloat.net

I am actually far from convinced it is actually a wifi bug. It could
be something going wrong with routing, firewalling, nat, or something
else entirely. I have several captures of sw00 and ge00 taken after
the event occurs, and local udp, arp, and icmp and icmpv6 traffic is
working correctly. As is multicast.

The other device (sw10) stays running…

What I see in the captures I have is syn attempts from the sw00
interface do make it to the internet, and syn/ack attempts do return
through ge00, but
do not make it through sw00. However I don’t see ANY local syn
attempts in the capture I have: jg or someone needs to try a local tcp
connection to a local device or through the local router to a local
ethernet device after having it hang… (I will keep trying to
reproduce here)

tcp.flags == 0x0002

On Wed, Apr 2, 2014 at 6:48 PM, Stephen Hemminger
stephen@networkplumber.org wrote:
> I am seeing wireless hang as well.
> Mostly when multiple macbooks are active on 2.4g

Updated by David Taht on Apr 5, 2014.
The above subject line and cc will get this conversation into
the bug tracker.

———- Forwarded message ———-
From: Dave Taht dave.taht@gmail.com
Date: Sat, Apr 5, 2014 at 9:15 AM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.34-4 dev build released
To: Neil Shepperd nshepperd@gmail.com
Cc: “cerowrt-devel@lists.bufferbloat.net” cerowrt-devel@lists.bufferbloat.net

In_trying_to_sort_out_the_differences_between_the_people
working_wifi_for_long_periods,vs_those_without…

I_am_curious_if_your_country
code_is_set,and_what_it_is_set_to,and_your_wifi_channel_set

It_is_long_past_time_we_start_up_a_formal_bug_for_this,
but_I’ll_wait_for_my_spacebar.

In_a_known_pretty_good_case:

root@lorna-gw:~# cat /etc/openwrt_release
DISTRIB_ID=“CeroWrt”
DISTRIB_RELEASE=“3.10.32-9”
DISTRIB_REVISION=“r39917”
DISTRIB_CODENAME=“toronto”
DISTRIB_TARGET=“ar71xx/generic”
DISTRIB_DESCRIPTION=“CeroWrt Toronto 3.10.32-9”
DISTRIB_TAINTS=“no-all busybox”

root@lorna-gw:~# uptime
16:07:37 up 21 days, 21:35, load average: 0.00, 0.01, 0.04

root@lorna-gw:~# egrep -i “country|channel|htmode” /etc/config/wireless
option channel 11
option htmode HT20
option channel ‘44’
option htmode HT40+
option country ‘US’

On Sat, Apr 5, 2014 at 9:02 AM, Dave Taht dave.taht@gmail.com wrote:
> On Sat, Apr 5, 2014 at 5:49 AM, Neil Shepperd nshepperd@gmail.com wrote:
>>> Sounds like you are going to stick with -4 for a bit?
>>
>> Actually, this is the first time I’ve tried cerowrt on a router. But
>> yeah, I’ll stick with the current version unless you come out with a new
>> patch to try.
>
> Thx. I am hoping this is the last priority 1 bug cerowrt has.
>
> but_fixing_it_is_going_to_be_pita.
>
> I_confess_to_“embedded_fatigue”.
>
>>> what I’ve been doing is mounting a usb stick, and just running continuously
>>> on the stick
>>>
>>> tcpdump -s 128 -i ge00 -w ge00.cap &
>>> tcpdump -s 128 -i sw00 -w sw00.cap &
>>>
>>> This definately hurts performance…
>>>
>>> And it’s probably time to do a tcpdump on the connected device as well.
>>>
>>
>> Update: I did this, and experienced the hang again. A first look at the
>> tcpdump output on sw00 shows a sudden reduction in traffic at 20:40:54,
>> so I assume that’s probably the time of the event. After that, I see
>> many DHCP and ARP requests arriving, but no responses leaving the interface.
>
> It_would_be_nice_to_see_10sec_of_these_captures_before_and_after.
>
>>
>> In fact, I don’t see anything leaving except, oddly, some DNS responses
>> (which are indeed received by my laptop). I also see some EAPOL stuff on
>> both the router and laptop at roughly the same time, so I guess that’s
>> getting through, but I don’t know the direction.
>>
>> I think next time I’ll try with -Pin/-Pout to separate incoming and
>> outgoing packets properly…
>
> Tis easier_to_sort_in_wireshark_against_one_capture,IMHO.
>
> I_have_been_looking_for_failed_syn_attempts_and_retries_as_a_key_indicator
> that_something_Bad_happened.
>
>>> Hmm. OK, this brings back the device driver into the equation… I
>>> WAS seeing dhcp and arp requests “getting through” from the captures,
>>> and it seemed like arp in particular was getting through…
>>
>> So I guess this is only half right? What I see in syslog is dnsmasq
>> saying it has sent a packet, but it doesn’t make it onto the interface.
>> Apart from DNS packets, so I don’t know what to make of that.
>
> It_is_possible_there_are_a_variety_of_failure_modes.
>
> I_am_not_entirely_convinced_this_is_actually_a_wifi_specific_failure.
>
> can_you_try_ssh_to_the_router_during_a_failure,and/or_accessing
> the_web_admin_interface?and/or_trying_to
>
> if_you_are_not_using_babel_disable_it.It_makes_a_lot_of_updates
> to_the_routing_table.that_might_be_malfunctioning..
>
> (I_really_need_a_keyboard_that_recovers_from_damp_weather.)
>
>> Neil
>
>
>

Updated by David Taht on Apr 5, 2014.
———- Forwarded message ———-
From: David Personette dperson@gmail.com
Date: Thu, Apr 3, 2014 at 6:26 PM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.34-4 dev build released
To: Maxim Kharlamov mcs@podsolnuh.biz
Cc: Dave Taht dave.taht@gmail.com,
“cerowrt-devel@lists.bufferbloat.net”
cerowrt-devel@lists.bufferbloat.net

I have an OSX laptop on 5ghz, a Linux desktop and server via ethernet,
Linux Laptop via 5gz, Roku via 5gz, Nexus 7 via 5gz, and misc other
devices… I didn’t get my total bandwidth on 3.10.32-12, 3.10.34-1,
but I’ve done 3.3GB down 0.9GB up since flashing 3.10.34-4. I’ve had
no problems on any of those builds. It’s been rock solid for me. I
work from home two days a week (Tues and Thurs), wireless connection
via my work OSX laptop. Since the 3.10.x series, I’ve noticed that
WiFi has been noticeably faster. If there is a roll-back of the
kernel, would it be possible to have a fork still with the latest
kernel too… otherwise how will it be known when the issue is fixed,
sorry to be a PitA.

Updated by David Taht on Apr 5, 2014.
———- Forwarded message ———-
From: Aaron Wood woody77@gmail.com
Date: Fri, Apr 4, 2014 at 12:04 AM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.34-4 dev build released
To: Dave Taht dave.taht@gmail.com
Cc: Maxim Kharlamov mcs@podsolnuh.biz,
“cerowrt-devel@lists.bufferbloat.net”
cerowrt-devel@lists.bufferbloat.net

On Fri, Apr 4, 2014 at 12:58 AM, Dave Taht dave.taht@gmail.com wrote:
>
> On Thu, Apr 3, 2014 at 3:57 PM, Aaron Wood woody77@gmail.com wrote:
> > On Fri, Apr 4, 2014 at 12:56 AM, Aaron Wood woody77@gmail.com wrote:
> >>
> >> Up for 10 days on 3.10.32-12 (WNDR3800). Only have 2 devices that run
> >> 2.4GHz, and it’s only seen 2GB of traffic on SW00 in that time… The 5GHz
> >> radio has had >5GB of traffic on it in the same time. No problems at all.
> >
> >
> > And I also have both 2.4 and 5GHz babel and guest SSIDs all turned off.
> >
> > -Aaron
>
> Your clients are?
>
> So far there seems to be a significant trend towards osx being an issue…

iOS 7 (a pair of iPhone 4’s). Everything that supports 5GHz is using 5GHz.

-Aaron

Updated by David Taht on Apr 5, 2014.
———- Forwarded message ———-
From: Maxim Kharlamov mcs@podsolnuh.biz
Date: Thu, Apr 3, 2014 at 3:51 PM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.34-4 dev build released
To: Dave Taht dave.taht@gmail.com
Cc: “cerowrt-devel@lists.bufferbloat.net” cerowrt-devel@lists.bufferbloat.net

The last release without wifi issues was 3.8.something (I think it was
called Berlin). The whole 3.10.x branch seems to have broken wifi
(will see how 3.10.24-4 goes, it seems OK, but it’s been working less
than 24hours yet).
I’m using only 2.4Ghz (5Ghz dead in the water - devices couldn’t
connect at all, so I disabled it). Guest and babel disabled.

Regards,
Max

On Fri, Apr 4, 2014 at 11:36 AM, Dave Taht dave.taht@gmail.com wrote:
>
> Is there a recent version that people had that was seemingly stable for
> wifi that we could step back to and bisect from? Something where
> you had heavy wifu use for week(s) without a problem?
>
> (I know that until we got focused on this, and people focused on
> reporting it, that maybe it was happening in releases I’d otherwise
> considered to be “pretty good”… so please report in on your “best”
> releases this year…)
>
> Worst case we can step back to that kernel for a while and proceed forward
> on all the other stuff. I know I crave stability at this point, and I’m
> unhappy that everyone here is unhappy, too…
>
> Regrettably since losing my lab I have not been in a position to easily
> test wifi to any huge extent. I’m slowly building that up (but for example
> no longer have a mac to test with)
>
>
> On Thu, Apr 3, 2014 at 11:20 AM, Neil Shepperd nshepperd@gmail.com wrote:
> > I just flashed 3.10.34-4 to my new WNDR3800 and experienced the exact
> > wifi hang described by Toke Høiland-Jørgensen. But I’m on the 2.4GHz
> > network (with guest and babel disabled). Unfortunately I didn’t think to
> > try tracing anything from the router side before resetting the wireless.
>
> cool you disabled guest and babel. So far we’ve sort of ruled out
> 6in4 tunnelling, and syn flood protection.
>
> Sounds like you are going to stick with -4 for a bit?
>
> what I’ve been doing is mounting a usb stick, and just running continuously
> on the stick
>
> tcpdump -s 128 -i ge00 -w ge00.cap &
> tcpdump -s 128 -i sw00 -w sw00.cap &
>
> This definately hurts performance…
>
> And it’s probably time to do a tcpdump on the connected device as well.
>
> In terms of other diags… (any suggestions?)
>
> > Syslog was filled with a lot of
> >
> > DHCPDISCOVER (sw00) [MAYBE IP] [MAC ADDRESS]
> > DHCPOFFER (sw00) [IP] [MAC ADDRESS]
>
> Hmm. OK, this brings back the device driver into the equation… I
> WAS seeing dhcp and arp requests “getting through” from the captures,
> and it seemed like arp in particular was getting through…
>
> >
> > but the offers aren’t being received at my laptop.
> >
> > Just another data point I guess.
>
> Well, I’d hoped it would be a confirming one rather than one opening
> up more questions.
>
> > Neil
> > **_
> > Cerowrt-devel mailing list
> > Cerowrt-devel@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/cerowrt-devel >
>
>

Updated by David Taht on Apr 5, 2014.
———- Forwarded message ———-
From: Jim Gettys jg@freedesktop.org
Date: Thu, Apr 3, 2014 at 8:17 AM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.34-4 dev build released
To: Stephen Hemminger stephen@networkplumber.org
Cc: Dave Taht dave.taht@gmail.com,
“cerowrt-devel@lists.bufferbloat.net”
cerowrt-devel@lists.bufferbloat.net

n Wed, Apr 2, 2014 at 9:48 PM, Stephen Hemminger
stephen@networkplumber.org wrote:
>
> I am seeing wireless hang as well.
> Mostly when multiple macbooks are active on 2.4g
>
>

Also true in my house: both kids are on Macbooks.

But I’ve seen the problem with no-one but me (on Linux) around, so all
that says is that if the router is in use more, you see more failures.
So I’m not sure I can draw much from this experience.

Updated by David Taht on Apr 5, 2014.
In trying to sort this stuff out (been looking at a ton of commits between
3.10.34 and 3.14) I have a few candidates in various parts of the
stack to try to backport.

3.10.34-3.10.36 does not seem to have any relevant patches, but I just
updated to 3.10.36 anyway.

In openwrt head, there has been a problem in dhcpv6 renews, which you
can see on the dhcpv6 web page after a day or so. That looks to be
fixed now.

So I just merged from openwrt head.

I try to be happy that most of our problems are now taking days to crop up.

I will probably produce a topic branch at this point which will have heavy
levels of debugging enabled. I’d like to be able to trace packets from
origin to (non) exit, somehow…

Updated by David Personette on Apr 5, 2014.
I assume that you wanted other people to report their status? Working here:

root@outpost:~# cat /etc/openwrt_release
DISTRIB_ID=“CeroWrt”
DISTRIB_RELEASE=“3.10.34-4”
DISTRIB_REVISION=“r40361”
DISTRIB_CODENAME=“toronto”
DISTRIB_TARGET=“ar71xx/generic”
DISTRIB_DESCRIPTION=“CeroWrt Toronto 3.10.34-4”
DISTRIB_TAINTS=“no-all busybox”

root@outpost:~# uptime
22:52:44 up 2 days, 11:16, load average: 0.00, 0.01, 0.04

root@outpost:~# egrep -i “country|channel|htmode” /etc/config/wireless
option channel 11
option htmode HT40-
option channel 36
option htmode HT40+

Updated by Neil Shepperd on Apr 5, 2014.
Experiencing the bug every few days:

root@cerowrt:~# cat /etc/openwrt_release
DISTRIB_ID=“CeroWrt”
DISTRIB_RELEASE=“3.10.34-4”
DISTRIB_REVISION=“r40361”
DISTRIB_CODENAME=“toronto”
DISTRIB_TARGET=“ar71xx/generic”
DISTRIB_DESCRIPTION=“CeroWrt Toronto 3.10.34-4”
DISTRIB_TAINTS=“no-all busybox”

root@cerowrt:~# uptime
12:22:42 up 1 day, 18:37, load average: 0.04, 0.04, 0.05

root@cerowrt:~# egrep -i “country|channel|htmode” /etc/config/wireless
option htmode ‘HT20’
option country ‘AU’
option channel ‘auto’
option htmode ‘HT20’
option country ‘AU’
option channel ‘auto’

It_would_be_nice_to_see_10sec_of_these_captures_before_and_after.

Uploaded at http://zlkj.in/files/wireshark/. I filtered the captures in
wireshark for frame.time > “April 5, 2014 20:40:44” which is about 10
seconds before the bug. wlan0.cap is the capture from my laptop. ppp.cap
is from the pppoe connection on ge00.

It_is_possible_there_are_a_variety_of_failure_modes.

I_am_not_entirely_convinced_this_is_actually_a_wifi_specific_failure.

can_you_try_ssh_to_the_router_during_a_failure,and/or_accessing
the_web_admin_interface?and/or_trying_to

I can ssh in and access the admin interface if I connect my laptop by an
ethernet cable. But during the failure, I can’t access the admin
interface or the internet over sw00. After resetting sw00 by admin
interface on se00, I can connect over the wireless again.

if_you_are_not_using_babel_disable_it.It_makes_a_lot_of_updates
to_the_routing_table.that_might_be_malfunctioning..

I thought I disabled babel, but I’m still seeing babel packets in the
capture, so I guess disabling the “babel” networks on both radios in the
wifi tab is not enough.

Updated by David Taht on Apr 7, 2014.
If you are lucky enough to have a working iwl or ath9k or otherwise
supported mac80211 card in your laptop and are running linux, install
aircrack-ng, and use the airmon-ng tool to setup a monitoring
interface.

What I’m doing at the moment is capturing the mon0 interface with
wireshark while beating up the network as much as I can. (and trying
to come up with ways to parse the results sanely)

http://wiki.wireshark.org/CaptureSetup/WLAN

There are some instructions for BSD OSX in there too.

There isn’t a way to do this in windows, apparently, without a special device:

http://www.riverbed.com/products-solutions/products/network-performance-management/wireshark-enhancement-products/Wireless-Traffic-Packet-Capture.html

Updated by David Taht on Apr 8, 2014.
Finally found the smoke, from a gun still offstage.

The background wifi queue (1:40) gets wedged.

This explains why this only seemed to happen on comcast (Which
re-marks a LOT of traffic
background that it shouldn’t, and yes we should start mangling packets
back to “be” in sqm
as an option), and why local traffic seemed to mostly work when stuff
coming back from the internet didn’t.

As to why it happens, don’t know. I’m sitting in the #bufferbloat channel
scratching my head as to means to explore the problem without
unwedging the interface.

It seems plausible we can MUCH more easily reproduce this now by flooding the
background queues with traffic (netperf can do this). It’s not clear
you can trigger it
with just tcp however or if multiple hops are required, etc, etc.

root@cerowrt:/mnt/disk1# tc -s qdisc show dev sw00
qdisc mq 1: root
Sent 3926131082 bytes 2998293 pkt (dropped 91657, overlimits 0 requeues 70095)
backlog 77608b 1000p requeues 70095
qdisc fq_codel 10: parent 1:1 limit 800p flows 1024 quantum 500 target
10.0ms interval 100.0ms
Sent 110555 bytes 771 pkt (dropped 0, overlimits 0 requeues 5)
backlog 0b 0p requeues 5
maxpacket 256 drop_overlimit 0 new_flow_count 2 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 20: parent 1:2 limit 800p flows 1024 quantum 300 target
5.0ms interval 100.0ms ecn
Sent 2526448 bytes 17982 pkt (dropped 1, overlimits 0 requeues 31)
backlog 0b 0p requeues 31
maxpacket 929 drop_overlimit 0 new_flow_count 71 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 30: parent 1:3 limit 1000p flows 1024 quantum 300
target 5.0ms interval 100.0ms ecn
Sent 15145657 bytes 106290 pkt (dropped 0, overlimits 0 requeues 179)
backlog 0b 0p requeues 179
maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 40: parent 1:4 limit 1000p flows 1024 quantum 300
target 5.0ms interval 100.0ms
Sent 3908348422 bytes 2873250 pkt (dropped 91656, overlimits 0 requeues 69880)
backlog 77608b 1000p requeues 69880
^^\^![]()![]()!

maxpacket 1514 drop_overlimit 72128 new_flow_count 85727 ecn_mark 0
new_flows_len 238 old_flows_len 1

I got the “wedged” interface to work again re-marking all tcp traffic
as best effort”

iptables -A FORWARD -o sw00 -t mangle -p tcp -m tcp -j DSCP –set-dscp-class be

thus moving traffic into 1:3 above.

(can probably improve on this iptables thing, but it’s just a
workaround and for all I know we can also trigger this on the be
queue)

icmp replies however, seems to want to always go into the background
queue for some reason. (?)

We did have this happen earlier on this run

[31325.589843] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32380.960937] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32381.035156] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32381.140625] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32381.242187] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32381.343750] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32418.824218] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32445.863281] ath: phy0: Failed to stop TX DMA, queues=0x108!
[32445.960937] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.062500] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.164062] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.265625] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.367187] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.472656] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.574218] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.683593] ath: phy0: Failed to stop TX DMA, queues=0x00c!
[32446.777343] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.886718] ath: phy0: Failed to stop TX DMA, queues=0x009!
[34701.062500] ath: phy0: Failed to stop TX DMA, queues=0x008!
[34701.140625] ath: phy0: Failed to stop TX DMA, queues=0x008!
[34701.242187] ath: phy0: Failed to stop TX DMA, queues=0x008!

Updated by Dave Täht on Apr 8, 2014.
btw we do also have unaligned instructions still:

root@cerowrt:/sys/kernel/debug/mips# cat unaligned_instructions
1154

and we are also using a very short qlen_be and qlen_bk = 12

and the debloat script tosses stuff on md’s queues 1:1,1:2,1:3,1:4 rather than the default and invisible md 0:1, etc.

While saturating the be queue with a couple netperfs, I get:

root@cerowrt:/sys/kernel/debug/ieee80211/phy0/netdev:sw00/stations/00:15:6d:84:b3:00# cat rc_stats
type rate throughput ewma prob this prob retry this succ/attempt success attempts
CCK/LP 1.0M 0.7 96.3 100.0 0 0( 0) 973 1003
CCK/SP 2.0M 1.5 100.0 100.0 0 0( 0) 1 1
CCK/SP 5.5M 3.8 100.0 100.0 0 0( 0) 1 1
CCK/SP 11.0M 6.1 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS0 5.7 100.0 100.0 3 0( 0) 1 1
HT20/LGI MCS1 11.5 95.7 100.0 0 0( 0) 12 13
HT20/LGI MCS2 16.7 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS3 21.9 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS4 31.5 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS5 40.1 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS6 44.0 96.2 100.0 0 0( 0) 18 20
HT20/LGI MCS7 48.8 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS8 11.5 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS9 21.9 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS10 31.5 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS11 40.1 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS12 56.1 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS13 68.0 95.6 100.0 0 0( 0) 16 18
HT20/LGI MCS14 74.9 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS15 80.2 100.0 100.0 6 0( 0) 1 1
HT40/LGI MCS0 11.9 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS1 22.8 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS2 32.5 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS3 41.6 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS4 57.6 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS5 70.1 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS6 77.5 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS7 83.3 96.0 100.0 5 0( 0) 301 319
HT40/LGI MCS8 22.8 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS9 41.6 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS10 57.6 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS11 70.1 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS12 93.6 99.7 100.0 6 0( 0) 45 46
HT40/LGI MCS13 107.1 95.6 100.0 5 0( 0) 2743 3151
HT40/LGI t MCS14 118.4 92.6 93.4 6 172(184) 53259 64221
HT40/LGI MCS15 94.1 67.8 100.0 6 2( 2) 29077 40909
HT40/SGI MCS0 13.2 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS1 25.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS2 35.5 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS3 45.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS4 62.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS5 75.2 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS6 82.7 100.0 100.0 5 0( 0) 9 9
HT40/SGI P MCS7 88.5 98.5 100.0 5 0( 0) 967 1145
HT40/SGI MCS8 25.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS9 45.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS10 62.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS11 75.2 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS12 99.0 96.6 100.0 5 0( 0) 821 876
HT40/SGI MCS13 112.4 95.7 100.0 6 1( 1) 88413 94617
HT40/SGI MCS14 86.6 63.1 0.0 6 0( 1) 715161 805305
HT40/SGI T MCS15 122.6 84.8 100.0 6 1( 1) 98281 127685

Updated by Dave Täht on Apr 8, 2014.
And interestingly, after disabling the bk queue with iptables, and waiting a while, the 1000 packet backlog cleared.

qdisc mq 1: root
Sent 6376179748 bytes 5429375 pkt (dropped 92662, overlimits 0 requeues 98880)
backlog 0b 0p requeues 98880
qdisc fq_codel 10: parent 1:1 limit 800p flows 1024 quantum 500 target 10.0ms interval 100.0ms
Sent 115759 bytes 807 pkt (dropped 0, overlimits 0 requeues 5)
backlog 0b 0p requeues 5
maxpacket 256 drop_overlimit 0 new_flow_count 2 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 20: parent 1:2 limit 800p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
Sent 3053074 bytes 25673 pkt (dropped 1, overlimits 0 requeues 38)
backlog 0b 0p requeues 38
maxpacket 929 drop_overlimit 0 new_flow_count 73 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 30: parent 1:3 limit 1000p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
Sent 2464586793 bytes 2528666 pkt (dropped 947, overlimits 0 requeues 28957)
backlog 0b 0p requeues 28957
maxpacket 1514 drop_overlimit 0 new_flow_count 82547 ecn_mark 1
new_flows_len 0 old_flows_len 1
qdisc fq_codel 40: parent 1:4 limit 1000p flows 1024 quantum 300 target 5.0ms interval 100.0ms
Sent 3908424122 bytes 2874229 pkt (dropped 91714, overlimits 0 requeues 69880)
backlog 0b 0p requeues 69880
maxpacket 1514 drop_overlimit 72166 new_flow_count 85740 ecn_mark 0
new_flows_len 1 old_flows_len 251

Updated by Dave Täht on Apr 8, 2014.
(01:32:44 PM) dtaht_nuc: qdisc fq_codel 40: parent 1:4 limit 1000p flows 1024 quantum 300 target 5.0ms interval 100.0ms
(01:32:44 PM) dtaht_nuc: Sent 3908424122 bytes 2874229 pkt (dropped 91714, overlimits 0 requeues 69880)
(01:32:44 PM) dtaht_nuc: backlog 0b 0p requeues 69880
(01:32:44 PM) dtaht_nuc: maxpacket 1514 drop_overlimit 72166 new_flow_count 85740 ecn_mark 0
(01:32:44 PM) dtaht_nuc: new_flows_len 1 old_flows_len 251
(01:40:01 PM) dtaht_nuc: ok, so I just tried a limited exercise of the bk queue
(01:40:10 PM) dtaht_nuc: it is indeed still wedged after it cleared
(01:40:11 PM) dtaht_nuc: qdisc fq_codel 40: parent 1:4 limit 1000p flows 1024 quantum 300 target 5.0ms interval 100.0ms
(01:40:11 PM) dtaht_nuc: Sent 3908424196 bytes 2874230 pkt (dropped 91714, overlimits 0 requeues 69881)
(01:40:11 PM) dtaht_nuc: backlog 518b 8p requeues 69881
(01:40:11 PM) dtaht_nuc: maxpacket 1514 drop_overlimit 72166 new_flow_count 85741 ecn_mark 0
(01:40:11 PM) dtaht_nuc: new_flows_len 2 old_flows_len 251
(01:40:46 PM) dtaht_nuc: might be wedged on input too
(01:46:55 PM) dtaht_nuc: and lastly, resetting the qdisc does NOT fix the problem
Updated by Dave Täht on Apr 8, 2014.
root@cerowrt:/sys/kernel/debug/ieee80211/phy0/ath9k# cat xmit
BE BK VI VO

MPDUs Queued: 50 2314 223 652400
MPDUs Completed: 40281 85321 8731 616194
MPDUs XRetried: 242 2644 350 36901
Aggregates: 783024 486368 202 0
AMPDUs Queued HW: 0 0 0 0
AMPDUs Queued SW: 4456600 2875990 60600 695
AMPDUs Completed: 4416029 2786190 51561 0
AMPDUs Retried: 253185 110105 1007 0
AMPDUs XRetried: 96 3990 181 0
TXERR Filtered: 70 5281 169 77
FIFO Underrun: 0 0 0 0
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 4456648 2878145 60823 653095
TX-Bytes-All: 457325109 4000133941 7212777 113292607
HW-put-tx-buf: 334 188 128 330
HW-tx-start: 1385587 1596997 61534 653095
HW-tx-proc-desc: 1385547 1612853 61531 653067
TX-Failed: 0 0 0 0
root@cerowrt:/sys/kernel/debug/ieee80211/phy0/ath9k# cat xmit
BE BK VI VO

MPDUs Queued: 50 2314 223 652400
MPDUs Completed: 40286 85321 8731 616194
MPDUs XRetried: 242 2644 350 36901
Aggregates: 784312 486368 202 0
AMPDUs Queued HW: 0 0 0 0
AMPDUs Queued SW: 4464165 2875990 60600 695
AMPDUs Completed: 4423589 2786190 51561 0
AMPDUs Retried: 253680 110105 1007 0
AMPDUs XRetried: 96 3990 181 0
TXERR Filtered: 70 5281 169 77
FIFO Underrun: 0 0 0 0
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 4464213 2878145 60823 653095
TX-Bytes-All: 468998281 4000133941 7212777 113292607
HW-put-tx-buf: 334 188 128 330
HW-tx-start: 1388889 1596997 61534 653095
HW-tx-proc-desc: 1388849 1612853 61531 653067
TX-Failed: 0 0 0 0

Updated by Dave Täht on Apr 8, 2014.
root@cerowrt:/sys/kernel/debug/ieee80211/phy0/ath9k# cat queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 151 stopped: 1
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
Updated by Dave Täht on Apr 8, 2014.
It is possible to crash the BE queue too

root@cerowrt:~# tc -s qdisc show dev sw10
qdisc mq 1: root
Sent 3852715919 bytes 2982888 pkt (dropped 5360, overlimits 0 requeues 55107)
backlog 99468b 1000p requeues 55107
qdisc fq_codel 10: parent 1:1 limit 800p flows 1024 quantum 500 target 10.0ms interval 100.0ms
Sent 41188 bytes 292 pkt (dropped 0, overlimits 0 requeues 1)
backlog 0b 0p requeues 1
maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 20: parent 1:2 limit 800p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
Sent 1792325 bytes 8919 pkt (dropped 0, overlimits 0 requeues 22)
backlog 0b 0p requeues 22
maxpacket 1514 drop_overlimit 0 new_flow_count 19 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 30: parent 1:3 limit 1000p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
Sent 1537736330 bytes 1266113 pkt (dropped 2479, overlimits 0 requeues 19919)
backlog 99468b 1000p requeues 19919
maxpacket 1514 drop_overlimit 710 new_flow_count 16535 ecn_mark 14
new_flows_len 71 old_flows_len 1
qdisc fq_codel 40: parent 1:4 limit 1000p flows 1024 quantum 300 target 5.0ms interval 100.0ms
Sent 2313146076 bytes 1707564 pkt (dropped 2881, overlimits 0 requeues 35165)
backlog 0b 0p requeues 35165
maxpacket 1514 drop_overlimit 0 new_flow_count 22111 ecn_mark 0
new_flows_len 0 old_flows_len 0

Updated by Dave Täht on Apr 8, 2014.
And this time, we are stopped at 12, which is also what qlen_be is set to

root@cerowrt:/sys/kernel/debug/ieee80211/phy1/ath9k# cat queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 12 stopped: 1
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0

Updated by David Taht on Apr 9, 2014.
See also: http://www.bufferbloat.net/issues/442#note-16

1) It’s still uncertain that we have only been dealing with one wireless bug…

…but we can narrow down the jg was seeing to if - after a failure
happens and you can login on another radio or via ethernet - if you
see frames “pending”, that stay pending, in
the “queues” debug file:

root@comcast-gw:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues

(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 151 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0

you’ve hit the bug.

Nothing short of a reboot will clear it, presently. Felix is looking into it.

In the interim there are two things you can do to make hitting it a
LOT more difficult,
at least so far, in testing 20+ hours we haven’t hit it again

A) Stop reducing qlen_be, qlen_bk, qlen_vi, & qlen_vo.

comment out line 1977 of /usr/sbin/debloat

local function wireless(model)
print(model)
if WCALLBACKS[model] ~= nil then
– wireless_qlen() – comment out this call
return WCALLBACKS[model]()
else
usage(“AQM model not found”)
end
return nil
end

and reboot.

This will return the qlen’s to very large values that are nearly
impossible to hit.

While this will have a negative effect on latency, it will improve
single station bandwidth somewhat, and make it much harder to hang the
queue. (I think/hope)

I will argue - at this point - it is better to have a slower box that
stays up for weeks than one that has core functionality crash after a
few hours or days.

Those of you that have been experiencing the wifi hangs, please make
this change,
and check in daily?

If anyone has a hang, please post the ath9 queues status as per above,
and tc -s qdisc output to bug 442.

B) Mash incoming diffserv traffic down to BE only.

I have some patches almost ready for sqm-scripts for this, partially tested.

I’ve pushed them to the ceropackages github repository for review and testing.

see commit log message here.

https://github.com/dtaht/ceropackages-3.10/commit/27eed160a67700caae85a4c8b3fff0eaa990cd27

I am pretty sure fixing only fix “A” is need for working around the bug here

  • B might make bi-directional over-the-internet-through-wifi tests
    work better in that the BE queue is used more often - but both hacks
    are in place on the box we’re testing.
Updated by David Taht on Apr 9, 2014.
———- Forwarded message ———-
From: Dave Taht dave.taht@gmail.com
Date: Wed, Apr 9, 2014 at 2:41 PM
Subject: [Bug #442] Possible workaround for the wireless hangs
To: cerowrt@lists.bufferbloat.net

See also: http://www.bufferbloat.net/issues/442#note-16

1) It’s still uncertain that we have only been dealing with one wireless bug…

…but we can narrow down the jg was seeing to if - after a failure
happens and you can login on another radio or via ethernet - if you
see frames “pending”, that stay pending, in
the “queues” debug file:

root@comcast-gw:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues

(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 151 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0

you’ve hit the bug.

Nothing short of a reboot will clear it, presently. Felix is looking into it.

In the interim there are two things you can do to make hitting it a
LOT more difficult,
at least so far, in testing 20+ hours we haven’t hit it again

A) Stop reducing qlen_be, qlen_bk, qlen_vi, & qlen_vo.

comment out line 1977 of /usr/sbin/debloat

local function wireless(model)
print(model)
if WCALLBACKS[model] ~= nil then
– wireless_qlen() – comment out this call
return WCALLBACKS[model]()
else
usage(“AQM model not found”)
end
return nil
end

and reboot.

This will return the qlen’s to very large values that are nearly
impossible to hit.

While this will have a negative effect on latency, it will improve
single station bandwidth somewhat, and make it much harder to hang the
queue. (I think/hope)

I will argue - at this point - it is better to have a slower box that
stays up for weeks than one that has core functionality crash after a
few hours or days.

Those of you that have been experiencing the wifi hangs, please make
this change,
and check in daily?

If anyone has a hang, please post the ath9 queues status as per above,
and tc -s qdisc output to bug 442.

B) Mash incoming diffserv traffic down to BE only.

I have some patches almost ready for sqm-scripts for this, partially tested.

I’ve pushed them to the ceropackages github repository for review and testing.

see commit log message here.

https://github.com/dtaht/ceropackages-3.10/commit/27eed160a67700caae85a4c8b3fff0eaa990cd27

I am pretty sure fixing only fix “A” is need for working around the bug here

  • B might make bi-directional over-the-internet-through-wifi tests
    work better in that the BE queue is used more often - but both hacks
    are in place on the box we’re testing.
Updated by David Taht on Apr 11, 2014.
I think jim hits it soonest (this time in 36 hours) because he has a
family of geeks and A 100Mbit connection from the internet. Now that
the debloat script is changed to not muck with the defaults, this
definitely looks like a bug upstream in the linux kernel.

I also note that I thought I’d squashed dscp to BE in the 3.10.36-4
SQM simplest.qos AND simple.qos code, but was very tired that day and
probably missed something. Not that that helps - we managed to lock up
the BE queue last time too.

I don’t know if the number of stations matter or the number of macs
matter, or not. I will start even longer generation tests with more
stations as soon as I can, but I’m kind of wiped out right now.

———- Forwarded message ———-
From: Jim Gettys jg@freedesktop.org
Date: Fri, Apr 11, 2014 at 11:20 AM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.36-4 released
To: Dave Taht dave.taht@gmail.com

Unfortunately, the bug has recurred after a day and a half.

root@cerowrt:/sys/kernel/debug/ieee80211/phy0/ath9k# cat queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 278 stopped: 1
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0

Updated by David Taht on Apr 14, 2014.
So what we are seeing here is some sort of problem with accounting
for pending_frames on a given queue. And since we’ve had a reminder
of how useful code review can be - and I’d rather like to understand
this logic anyway -

// my comments in //

/* Upon failure caller should free skb */
int ath_tx_start(struct ieee80211_hw *hw, struct sk_buff *skb,
struct ath_tx_control *txctl)
{
struct ieee80211_hdr *hdr;
struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
struct ieee80211_sta *sta = txctl->sta;
struct ieee80211_vif *vif = info->control.vif;
struct ath_softc *sc = hw->priv;
struct ath_txq *txq = txctl->txq;
struct ath_atx_tid *tid = NULL;
struct ath_buf *bf;
int q;
int ret;

ret = ath_tx_prepare(hw, skb, txctl);
if (ret)
return ret;

hdr = (struct ieee80211_hdr *) skb->data;
/*
* At this point, the vif, hw_key and sta pointers in the tx control
* info are no longer valid (overwritten by the ath_frame_info data.
*/

// I haven’t looked at what skb_get_queue_mapping can return yet

q = skb_get_queue_mapping(skb);

ath_txq_lock(sc, txq);
if (txq == sc->tx.txq_map[q] &&
++txq->pending_frames > sc->tx.txq_max_pending[q] &&
!txq->stopped) {
ieee80211_stop_queue(sc->hw, q);
txq->stopped = true;

// is there a difference between stopped and sleeping?

}

// So if the queue is not mapped properly we don’t increment pending
// frames. Also we are dependent on C processing the if left to right,
// which is a good assumption, but it leaves the ++txq as a side effect

if (txctl->an && ieee80211_is_data_present(hdr->frame_control))
tid = ath_get_skb_tid(sc, txctl->an, skb);

if (info->flags & IEEE80211_TX_CTL_PS_RESPONSE) {
ath_txq_unlock(sc, txq);
txq = sc->tx.uapsdq;

// So here we have a bit of state that changes after we’ve got some pending
// state above that’s been changed. I imagine this lock could stay unlocked
// for a while and lead to races elsewhere.
// haven’t a clue what tx.uapsdq is

ath_txq_lock(sc, txq);
} else if (txctl->an &&
ieee80211_is_data_present(hdr->frame_control)) {
WARN_ON(tid->ac->txq != txctl->txq);

if (info->flags & IEEE80211_TX_CTL_CLEAR_PS_FILT)
tid->ac->clear_ps_filter = true;

/*
* Add this frame to software queue for scheduling later
* for aggregation.
*/
TX_STAT_INC(txq->axq_qnum, a_queued_sw);
__skb_queue_tail(&tid->buf_q, skb);
if (!txctl->an->sleeping)
ath_tx_queue_tid(txq, tid);
// so if we’re not sleeping, queue it up
// and regardless if we’re sleeping or not, schedule it
ath_txq_schedule(sc, txq);
goto out;
}

// So if data is not present OR txctl->an is invalid OR
IEEE80211_TX_CTL_PS_RESPONSE is set in flags
/// we fall through to here.

bf = ath_tx_setup_buffer(sc, txq, tid, skb);

// if we fell through to here, tid can be null unless data was present

if (!bf) {
ath_txq_skb_done(sc, txq, skb);
if (txctl->paprd)
dev_kfree_skb_any(skb);
else
ieee80211_free_txskb(sc->hw, skb);
goto out;
}

// Well, I note that we incremented the frames earlier in some cases
// should they be decremented above?

bf->bf_state.bfs_paprd = txctl->paprd;

if (txctl->paprd)
bf->bf_state.bfs_paprd_timestamp = jiffies;

ath_set_rates(vif, sta, bf);
ath_tx_send_normal(sc, txq, tid, skb);

// Not clear as to why you set_rates here, and I assume tx_send_normal
// sends a non-aggregate

out:
ath_txq_unlock(sc, txq);

return 0;
}

Updated by David Taht on Apr 14, 2014.
———- Forwarded message ———-
From: Ben Greear greearb@candelatech.com
Date: Mon, Apr 14, 2014 at 8:49 PM
Subject: Re: [ath9k-devel] ath9k queue hang
To: Dave Taht dave.taht@gmail.com, “ath9k-devel@lists.ath9k.org”
ath9k-devel@venema.h4ckr.net

On 04/14/2014 07:16 PM, Dave Taht wrote:
>
> We have been trying to replicate a bug in seeing wifi connections hanging
> in strange ways after tons of data is transferred… for several months now.
>
> The symptoms varied, anything from multicast failing to background or best
> effort traffic failing - from local access working with remote access
> not working…
>
> Last week, we finally got a situation where we had enough debugging on to see
> something that matches the symptoms we saw, in that one of the wifi queues
> would hang and leave the overlying qdisc full of packets that didn’t drain.

Sounds familiar…I had a relatively clean patch in the 3.9 days, but had some
issues merging along the way and haven’t bothered to rebase it, so patch is
not as clean as it used to be:

http://dmz2.candelatech.com/git/?p=linux-3.14.dev.y/.git;a=commitdiff;h=a34e34f46fbffc627dfc2d93c508f580fbaf29e2;hp=cce0d841338348c69ae6f7ef1b2bc8a6abea3fc4 http://dmz2.candelatech.com/git/?p=linux-3.14.dev.y/.git;a=commitdiff;h=3ecefa9c9f7eed21002dad7a6540d6d250297466;hp=134543c6fec7e28bf91272ce995b550b1bf73c62

I posted the patch to the mailing lists some time back..maybe a year or two ago.

If I recall, we could reproduce our problem fairly reliably by
stepping an attenuator
in 10 db steps while under load.

I’d be curious to know if you try it out and it works for you…

Thanks,
Ben

Updated by David Taht on Apr 15, 2014.
This ended up on the wrong bug.

On Tue, Apr 15, 2014 at 11:47 AM, cerowrt@lists.bufferbloat.net wrote:
>
> Issue #422 has been updated by Felix Fietkau.
>
>
> On 2014-04-15 06:06, Dave Taht wrote:
>> regrettably I am too wiped to look this over further right now, but the patchset
>> seems very promising.
>>
>> I will review on a fresh brain in the morning. Other eyeballs desired
>> - this will have to get patched on top of 3.14 and then backported to
>> the 3.10 backport….
> The patch is a rather crude workaround which unfortunately will not
> help with narrowing down the cause. Also, doing a chip reset because a
> software queue is stuck is overkill.
>
> Please test if this patch helps. The tid->paused flag is no longer
> necessary since my rework of the tx path.
> —
> — a/drivers/net/wireless/ath/ath9k/ath9k.h
> **+ b/drivers/net/wireless/ath/ath9k/ath9k.h
> @ -254,7 +254,6@ struct ath_atx_tid {
>
> s8 bar_index;
> bool sched;
> - bool paused;
> bool active;
> };
>
> — a/drivers/net/wireless/ath/ath9k/xmit.c
> **+ b/drivers/net/wireless/ath/ath9k/xmit.c
> @ -107,9 +107,6@ static void ath_tx_queue_tid(struct ath_
> {
> struct ath_atx_ac *ac = tid->ac;
>
> - if (tid->paused)
> - return;
> -
> if (tid->sched)
> return;
>
> @ -1407,7 +1404,6@ int ath_tx_aggr_start(struct ath_softc *
> ath_tx_tid_change_state(sc, txtid);
>
> txtid->active = true;
> - txtid->paused = true;
> *ssn = txtid->seq_start = txtid->seq_next;
> txtid->bar_index = -1;
>
> @ -1427,7 +1423,6@ void ath_tx_aggr_stop(struct ath_softc *
>
> ath_txq_lock(sc, txq);
> txtid->active = false;
> - txtid->paused = false;
> ath_tx_flush_tid(sc, txtid);
> ath_tx_tid_change_state(sc, txtid);
> ath_txq_unlock_complete(sc, txq);
> @ -1487,7 +1482,7@ void ath_tx_aggr_wakeup(struct ath_softc
> ath_txq_lock(sc, txq);
> ac->clear_ps_filter = true;
>
> - if (!tid->paused && ath_tid_has_buffered(tid)) {
> + if (ath_tid_has_buffered(tid)) {
> ath_tx_queue_tid(txq, tid);
> ath_txq_schedule(sc, txq);
> }
> @ -1510,7 +1505,6@ void ath_tx_aggr_resume(struct ath_softc
> ath_txq_lock(sc, txq);
>
> tid->baw_size = IEEE80211_MIN_AMPDU_BUF << sta->ht_cap.ampdu_factor;
> - tid->paused = false;
>
> if (ath_tid_has_buffered(tid)) {
> ath_tx_queue_tid(txq, tid);
> @ -1544,8 +1538,6@ void ath9k_release_buffered_frames(struc
> continue;
>
> tid = ATH_AN_2_TID(an, i);
> - if (tid->paused)
> - continue;
>
> ath_txq_lock(sc, tid->ac->txq);
> while (nframes > 0) {
> @ -1844,9 +1836,6@ void ath_txq_schedule(struct ath_softc *
> list_del(&tid->list);
> tid->sched = false;
>
> - if (tid->paused)
> - continue;
> -
> if (ath_tx_sched_aggr(sc, txq, tid, &stop))
> sent = true;
>
> @ -2698,7 +2687,6@ void ath_tx_node_init(struct ath_softc *
> tid->baw_size = WME_MAX_BA;
> tid->baw_head = tid->baw_tail = 0;
> tid->sched = false;
> - tid->paused = false;
> tid->active = false;
> __skb_queue_head_init(&tid->buf_q);
> __skb_queue_head_init(&tid->retry_q);
> —————————————-
> Bug #422: some dhcpv6 debugging
> https://www.bufferbloat.net/issues/422 >
> Author: David Taht
> Status: Closed
> Priority: Normal
> Assignee:
> Category:
> Target version:
>
>
> 1) dhvp6 stuff
>
>
> The failing command is this one.
>
>
> ubus call network.interface. notify_proto ‘{ [action]() 0, [link-up]() true,
> [keep]() false, [ip6prefix]() [ “2001:db8:0:f00::\/56,375,600” ], [dns]() [
> “fec0:0:0:1::1” ], [dns_
>
> search]() [ “domain.example” ] }’
>
>
> When it should be like this
>
>
> ubus call network.interface.ge00 notify_proto ‘{ [action]() 0, [link-up]()
> true, [keep]() false, [ip6prefix]() [ “2001:db8:0:f00::\/56,375,600” ],
> [dns]() [ “fec0:0:0:1::1” ], [dns_
>
> search]() [ “domain.example” ] }’
>
>
>
> So it appears that you try to call \$INTERFACE where in setup_interface,
> it’s actually “\$device”…
>
>
> except that when I made that change, I still had nothing right
>
>
> however, when I called this with two args rather than one…
>
>
> odhcp6c -N try -P 60 -s /lib/netifd/dhcpv6.script ge00 ge00 &
>
>
> it did find ge00… and did the automagic prefix assignment to the other
> interfaces…
>
>
> so there’s an off-by-one error somewhere… (and odhcp6c doesn’t start,
> regardless)
>
> it fails also on exit also lacking that interface param
>
> + ubus call network.interface. notify_proto { [action]() 0, [link-up]()
> false, [keep]() false }
>
> Elsewhere /lib/netifd/proto/dhcpv6.sh \$INTERFACE and \$config seem to be
> confused
>
> proto_export “INTERFACE=\$config”
>
> and that STILL didn’t fix it.
>
> hope this helps.
>
> My files
>
> 6relayd:
>
> config server default
> option master ge01 # tried ge00 too
> list network lan # tried the alias for the firewall as well as
> the actual devices and/or this not at all
> list network se00
> list network sw00
> list network sw10
> list network gw00
> list network guest # same crazy idea
> option rd server
> option dhcpv6 server
> option fallback_relay ‘rd dhcpv6 ndp’
>
> network
>
> config interface se00
> option ‘ifname’ ‘se00’
> option ‘proto’ ‘static’
> option ‘ipaddr’ ‘172.26.34.1’
> option ‘netmask’ ‘255.255.255.224’
> option ‘ip6assign’ ‘64’
>
> config interface ge00
> option ‘ifname’ ‘ge00’
> option ‘proto’ ‘dhcp’
>
> config interface ge01
> option ifname @ge00
> option proto dhcpv6
> option ‘broadcast’ ‘1’
> option ‘metric’ ‘2048’
> option ‘reqprefix’ ‘60’
>
> (the reason for the metric is that I let babel assign default gws)
>
> 2) in going through the env variables trying to figure out the “next prefix
> available” in the /etc/odhcp6c.user there’s no rollup list somewhere of the
> prefixes actually assigned to the pool of interfaces. Am trying to come up
> with the “right” way to integrate ahcp’s /128 concept
>
> 3) there doesn’t seem to be anything stopping you from running multiple
> copies of odhcpd
>
> 4) No ntp server support. My other assumption is that things like wins are
> common too, and I also use wpad…
>
>

Updated by David Taht on Apr 15, 2014.
Thx felix!

Given that there seems to be a potential race in the code
review I did at:

http://www.bufferbloat.net/issues/442#note-22

another thought is to make the increment and decrement of

txq->pending_frame atomic, or to do a flush before the unlock

What tree is this patch against?

On Tue, Apr 15, 2014 at 11:46 AM, Felix Fietkau nbd@openwrt.org wrote:
> On 2014-04-15 06:06, Dave Taht wrote:
>> regrettably I am too wiped to look this over further right now, but the patchset
>> seems very promising.
>>
>> I will review on a fresh brain in the morning. Other eyeballs desired
>> - this will have to get patched on top of 3.14 and then backported to
>> the 3.10 backport….
> The patch is a rather crude workaround which unfortunately will not
> help with narrowing down the cause. Also, doing a chip reset because a
> software queue is stuck is overkill.
>
> Please test if this patch helps. The tid->paused flag is no longer
> necessary since my rework of the tx path.
> —
> — a/drivers/net/wireless/ath/ath9k/ath9k.h
> **+ b/drivers/net/wireless/ath/ath9k/ath9k.h
> @ -254,7 +254,6@ struct ath_atx_tid {
>
> s8 bar_index;
> bool sched;
> - bool paused;
> bool active;
> };
>
> — a/drivers/net/wireless/ath/ath9k/xmit.c
> **+ b/drivers/net/wireless/ath/ath9k/xmit.c
> @ -107,9 +107,6@ static void ath_tx_queue_tid(struct ath_
> {
> struct ath_atx_ac *ac = tid->ac;
>
> - if (tid->paused)
> - return;
> -
> if (tid->sched)
> return;
>
> @ -1407,7 +1404,6@ int ath_tx_aggr_start(struct ath_softc *
> ath_tx_tid_change_state(sc, txtid);
>
> txtid->active = true;
> - txtid->paused = true;
> *ssn = txtid->seq_start = txtid->seq_next;
> txtid->bar_index = -1;
>
> @ -1427,7 +1423,6@ void ath_tx_aggr_stop(struct ath_softc *
>
> ath_txq_lock(sc, txq);
> txtid->active = false;
> - txtid->paused = false;
> ath_tx_flush_tid(sc, txtid);
> ath_tx_tid_change_state(sc, txtid);
> ath_txq_unlock_complete(sc, txq);
> @ -1487,7 +1482,7@ void ath_tx_aggr_wakeup(struct ath_softc
> ath_txq_lock(sc, txq);
> ac->clear_ps_filter = true;
>
> - if (!tid->paused && ath_tid_has_buffered(tid)) {
> + if (ath_tid_has_buffered(tid)) {
> ath_tx_queue_tid(txq, tid);
> ath_txq_schedule(sc, txq);
> }
> @ -1510,7 +1505,6@ void ath_tx_aggr_resume(struct ath_softc
> ath_txq_lock(sc, txq);
>
> tid->baw_size = IEEE80211_MIN_AMPDU_BUF << sta->ht_cap.ampdu_factor;
> - tid->paused = false;
>
> if (ath_tid_has_buffered(tid)) {
> ath_tx_queue_tid(txq, tid);
> @ -1544,8 +1538,6@ void ath9k_release_buffered_frames(struc
> continue;
>
> tid = ATH_AN_2_TID(an, i);
> - if (tid->paused)
> - continue;
>
> ath_txq_lock(sc, tid->ac->txq);
> while (nframes > 0) {
> @ -1844,9 +1836,6@ void ath_txq_schedule(struct ath_softc *
> list_del(&tid->list);
> tid->sched = false;
>
> - if (tid->paused)
> - continue;
> -
> if (ath_tx_sched_aggr(sc, txq, tid, &stop))
> sent = true;
>
> @ -2698,7 +2687,6@ void ath_tx_node_init(struct ath_softc *
> tid->baw_size = WME_MAX_BA;
> tid->baw_head = tid->baw_tail = 0;
> tid->sched = false;
> - tid->paused = false;
> tid->active = false;
> __skb_queue_head_init(&tid->buf_q);
> __skb_queue_head_init(&tid->retry_q);
>

Updated by Felix Fietkau on Apr 16, 2014.
On 2014-04-15 21:00, Dave Taht wrote:
> Thx felix!
>
> Given that there seems to be a potential race in the code
> review I did at:
>
> http://www.bufferbloat.net/issues/442#note-22 >
> another thought is to make the increment and decrement of
>
> txq->pending_frame atomic, or to do a flush before the unlock
I’m not convinced that there’s a race that involves txq->pending_frames.
There is no need to make the increment/decrement atomic, because that
variable is already protected by the txq lock.

What tree is this patch against?
mac80211 from OpenWrt trunk.

  • Felix
Updated by David Taht on Apr 16, 2014.
On Wed, Apr 16, 2014 at 6:11 AM, Felix Fietkau nbd@openwrt.org wrote:
> On 2014-04-15 21:00, Dave Taht wrote:
>> Thx felix!
>>
>> Given that there seems to be a potential race in the code
>> review I did at:
>>
>> http://www.bufferbloat.net/issues/442#note-22 >>
>> another thought is to make the increment and decrement of
>>
>> txq->pending_frame atomic, or to do a flush before the unlock
> I’m not convinced that there’s a race that involves txq->pending_frames.
> There is no need to make the increment/decrement atomic, because that
> variable is already protected by the txq lock.

It and “stopped” are briefly unprotected along that code path.


> What tree is this patch against?
mac80211 from OpenWrt trunk.

Thx, will try your patch today.

  • Felix
Updated by Felix Fietkau on Apr 16, 2014.
On 2014-04-16 17:34, Dave Taht wrote:
> On Wed, Apr 16, 2014 at 6:11 AM, Felix Fietkau nbd@openwrt.org wrote:
>> On 2014-04-15 21:00, Dave Taht wrote:
>>> Thx felix!
>>>
>>> Given that there seems to be a potential race in the code
>>> review I did at:
>>>
>>> http://www.bufferbloat.net/issues/442#note-22 >>>
>>> another thought is to make the increment and decrement of
>>>
>>> txq->pending_frame atomic, or to do a flush before the unlock
>> I’m not convinced that there’s a race that involves txq->pending_frames.
>> There is no need to make the increment/decrement atomic, because that
>> variable is already protected by the txq lock.
>
> It and “stopped” are briefly unprotected along that code path.
Where?

  • Felix
Updated by David Taht on Apr 16, 2014.
should I have said “de-protected”? in

linux-3.14/drivers/net/wireless/ath/ath9k/xmit.c

ath_txq_lock(sc, txq);
if (txq == sc->tx.txq_map[q] &&
++txq->pending_frames > sc->tx.txq_max_pending[q] &&
!txq->stopped) {
ieee80211_stop_queue(sc->hw, q);
txq->stopped = true;
}

if (txctl->an && ieee80211_is_data_present(hdr->frame_control))
tid = ath_get_skb_tid(sc, txctl->an, skb);

if (info->flags & IEEE80211_TX_CTL_PS_RESPONSE) {
ath_txq_unlock(sc, txq);
txq = sc->tx.uapsdq;
^^
ath_txq_lock(sc, txq);
} else if (txctl->an &&

On Wed, Apr 16, 2014 at 9:55 AM, Felix Fietkau nbd@openwrt.org wrote:
> On 2014-04-16 17:34, Dave Taht wrote:
>> On Wed, Apr 16, 2014 at 6:11 AM, Felix Fietkau nbd@openwrt.org wrote:
>>> On 2014-04-15 21:00, Dave Taht wrote:
>>>> Thx felix!
>>>>
>>>> Given that there seems to be a potential race in the code
>>>> review I did at:
>>>>
>>>> http://www.bufferbloat.net/issues/442#note-22 >>>>
>>>> another thought is to make the increment and decrement of
>>>>
>>>> txq->pending_frame atomic, or to do a flush before the unlock
>>> I’m not convinced that there’s a race that involves txq->pending_frames.
>>> There is no need to make the increment/decrement atomic, because that
>>> variable is already protected by the txq lock.
>>
>> It and “stopped” are briefly unprotected along that code path.
> Where?
>
> - Felix

Updated by David Taht on Apr 18, 2014.
Could part of it be as simple as not checking for ‘<=’ but only < in
txq_max_pending below?

in ath_tx_start:

ath_txq_lock(sc, txq);
if (txq == sc->tx.txq_map[q] &&
++txq->pending_frames > sc->tx.txq_max_pending[q] &&
!txq->stopped) {
ieee80211_stop_queue(sc->hw, q);
txq->stopped = true;
}

in ath_txq_skb_done:

if (txq->stopped &&
txq->pending_frames < sc->tx.txq_max_pending[q]) {
ieee80211_wake_queue(sc->hw, q);
txq->stopped = false;
}

Updated by Felix Fietkau on Apr 19, 2014.
On 2014-04-19 05:26, Dave Taht wrote:
> Could part of it be as simple as not checking for ‘<=’ but only < in
> txq_max_pending below?
I don’t see how that would make any meaningful difference in practice.
By the way, did you test my patch?

in ath_tx_start:

ath_txq_lock(sc, txq);
if (txq == sc->tx.txq_map[q] &&
++txq->pending_frames > sc->tx.txq_max_pending[q] &&
!txq->stopped) {
ieee80211_stop_queue(sc->hw, q);
txq->stopped = true;
}

in ath_txq_skb_done:

if (txq->stopped &&
txq->pending_frames < sc->tx.txq_max_pending[q]) {
ieee80211_wake_queue(sc->hw, q);
txq->stopped = false;
}

Updated by David Taht on Apr 19, 2014.
On Sat, Apr 19, 2014 at 4:22 AM, Felix Fietkau nbd@openwrt.org wrote:
> On 2014-04-19 05:26, Dave Taht wrote:
>> Could part of it be as simple as not checking for ‘<=’ but only < in
>> txq_max_pending below?
> I don’t see how that would make any meaningful difference in practice.

Didn’t think it would, still thought <= was more correct.

By the way, did you test my patch?

It is in the as yet untested 3.10.36-6 build, along with resetting qlen
down to 12 again to try to trigger the bug sooner.

http://snapon.lab.bufferbloat.net/~cero2/cerowrt/wndr/3.10.36-6/


> in ath_tx_start:
>
> ath_txq_lock(sc, txq);
> if (txq == sc->tx.txq_map[q] &&
> ++txq->pending_frames > sc->tx.txq_max_pending[q] &&
> !txq->stopped) {
> ieee80211_stop_queue(sc->hw, q);
> txq->stopped = true;
> }
>
> in ath_txq_skb_done:
>
> if (txq->stopped &&
> txq->pending_frames < sc->tx.txq_max_pending[q]) {
> ieee80211_wake_queue(sc->hw, q);
> txq->stopped = false;
> }
>
>

Updated by Jim Gettys on Apr 28, 2014.
running 3.10.38-1. 2.4ghz hung.

root@cerowrt:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 12 stopped: 1
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0

Updated by David Taht on Apr 28, 2014.
———- Forwarded message ———-
From: Jim Gettys jg@freedesktop.org
Date: Mon, Apr 28, 2014 at 3:10 PM
Subject: [Cerowrt-devel] [bug #442] unfortunately, not fixed.
To: “cerowrt-devel@lists.bufferbloat.net” cerowrt-devel@lists.bufferbloat.net

running 3.10.38-1. 2.4ghz hung.
- Jim

root@cerowrt:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 12 stopped: 1
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0

**_
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel

Updated by Dave Täht on May 14, 2014.
I have been trying to find more ways to tweak this bug faster.

Felix, the patch you’d given me to try - did that make it upstream? it stopped applying to my code and I’d dropped it, can try to update it… but I’m seeing signs it’s higher in the stack.

Last night I downloaded and installed openwrt head onto an archer C7 v2 platform, and in about 4 hours got the BK and VI queues to fail using the rrul test, on a WPA2 psk misc enabled system, no fiddling with qlens. The BE queue is fine. So, now I’ve pretty much ruled out cerowrt’s hardware, and build, as the cause of the problem, and it seems like it is universal to the ath9k and/or openwrt. Some of what I see here might mean it’s not an ath9k problem either!

DISTRIB_ID=“OpenWrt”
DISTRIB_RELEASE=“Bleeding Edge”
DISTRIB_REVISION=“r40755”
DISTRIB_CODENAME=“barrier_breaker”
DISTRIB_TARGET=“ar71xx/generic”
DISTRIB_DESCRIPTION=“OpenWrt Barrier Breaker r40755”
DISTRIB_TAINTS=“”

I’ve finally got enough hardware up and the monitoring interface figured out enough to capture and decrypt packets in the air, but didn’t do that last night.

Anyway, this failure looks like this - BK queue is hosed, BE is not, netperf negotiates a connection, then netperf flips the tos bit and no data comes through:

d@ida:~/public_html/archer/overnight\$ netperf -Y CS1,CS1 -H 172.21.0.1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.21.0.1 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10\^6bits/sec

87380 16384 16384 10.00 0.00

To get here I ran the rrul test over and over (which exercises each queue using CS0, CS1, CS5, and EF markings. ) the data files are in http://snapon.lab.bufferbloat.net/~d/archer/overnight

http://snapon.lab.bufferbloat.net/~d/archer/overnight/normality.png # random sample from earlier in the night

http://snapon.lab.bufferbloat.net/~d/archer/overnight/normality2.png # shortly before it went boom

http://snapon.lab.bufferbloat.net/~d/archer/overnight/bye_vi_vo_queue.png # vi and vo go away

http://snapon.lab.bufferbloat.net/~d/archer/overnight/bye_bk_queue.png # bk queue goes away too

It is kind of interesting that the failures started happening just as people were waking up and getting on the internet (6am), so I will return to testing with more interference on the link….

There is no info in queues

root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# cat queues
(VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(CAB): qnum: 8 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0

These failures failed long before the failure:

[  593.440000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[  635.940000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[  648.130000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 1188.800000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 1626.470000] ath: phy1: Failed to stop TX DMA, queues=0x00e!
[ 1748.010000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 1766.240000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 2909.640000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3104.710000] ath: phy1: Failed to stop TX DMA, queues=0x004!
[ 3431.860000] Failed to load ipt action
[ 3431.950000] netem: version 1.3
[ 3555.790000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3561.930000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3586.300000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3671.600000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3756.900000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3817.930000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4189.750000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4201.940000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4909.110000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4933.480000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4939.520000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 5037.110000] ath: phy1: Failed to stop TX DMA, queues=0x004!
[ 5915.000000] ath: phy1: Failed to stop TX DMA, queues=0x00d!
[ 6152.780000] ath: phy1: Failed to stop TX DMA, queues=0x00f!
[ 6644.290000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 6668.560000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 6729.280000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 6735.420000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 6923.430000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 7882.410000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 8908.970000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 8921.060000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 9036.560000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 9097.290000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[31969.060000] ath: phy1: Failed to stop TX DMA, queues=0x005!

Trying to send anything marked CS1. You’d think it would be trying, but
aggregates or tx bytes don’t budge.

root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# cat xmit 
                            BE         BK        VI        VO

MPDUs Queued:                5          0      1042     24803
MPDUs Completed:          1541       2153      4316     40355
MPDUs XRetried:              1          2        33        60
Aggregates:            2309316      80520   1015730         0
AMPDUs Queued HW:            0          0         0         0
AMPDUs Queued SW:     53325311     637157  18719054     15612
AMPDUs Completed:     53323234     634851  18696242         0
AMPDUs Retried:         823483      14659    556819         0
AMPDUs XRetried:           524        151     19404         0
TXERR Filtered:            189         42       239         2
FIFO Underrun:               0          0         0         1
TXOP Exceeded:               0          0         0         0
TXTIMER Expiry:              0          0         0         0
DESC CFG Error:              0          0         0         0
DATA Underrun:               0          0         0         0
DELIM Underrun:              0          0         0         0
TX-Pkts-All:          53325300     637157  18719995     40415
TX-Bytes-All:        270739492  1785710593602584059   6533793
HW-put-tx-buf:         3442803     225830   1447392     40415
HW-tx-start:                 0          0         0         0
HW-tx-proc-desc:       3441390     224351   1446571     40329
TX-Failed:                   0          0         0         0
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# killall netserver
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# netserver
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# cat xmit 
                            BE         BK        VI        VO

MPDUs Queued:                5          0      1042     24868
MPDUs Completed:          1541       2153      4316     40421
MPDUs XRetried:              1          2        33        60
Aggregates:            2309316      80520   1015730         0
AMPDUs Queued HW:            0          0         0         0
AMPDUs Queued SW:     53325413     637157  18719054     15613
AMPDUs Completed:     53323336     634851  18696242         0
AMPDUs Retried:         823483      14659    556819         0
AMPDUs XRetried:           524        151     19404         0
TXERR Filtered:            189         42       239         2
FIFO Underrun:               0          0         0         1
TXOP Exceeded:               0          0         0         0
TXTIMER Expiry:              0          0         0         0
DESC CFG Error:              0          0         0         0
DATA Underrun:               0          0         0         0
DELIM Underrun:              0          0         0         0
TX-Pkts-All:          53325402     637157  18719995     40481
TX-Bytes-All:        270762029  1785710593602584059   6547198
HW-put-tx-buf:         3442905     225830   1447392     40481
HW-tx-start:                 0          0         0         0
HW-tx-proc-desc:       3441492     224351   1446571     40395
TX-Failed:                   0          0         0         0

And we don’t show any packets attempting to enter the bk queue (1:4)
either. (same test as above). Deleting and recreating the qdisc
doesn’t work either.

(I note that I am trying huge targets and intervals with some success with the
longer qlens….)

root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# tc -s qdisc show dev wlan1
qdisc mq 1: root 
 Sent 42609841871 bytes 63704518 pkt (dropped 123636, overlimits 0 requeues 291608) 
 backlog 0b 0p requeues 291608 
qdisc fq_codel 803d: parent 1:1 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms 
 Sent 1044525 bytes 15565 pkt (dropped 0, overlimits 0 requeues 408) 
 backlog 0b 0p requeues 408 
  maxpacket 256 drop_overlimit 0 new_flow_count 201 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 803e: parent 1:2 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms ecn 
 Sent 10914015954 bytes 17822230 pkt (dropped 15479, overlimits 0 requeues 77502) 
 backlog 0b 0p requeues 77502 
  maxpacket 1514 drop_overlimit 0 new_flow_count 223799 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 803f: parent 1:3 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms ecn 
 Sent 31540575052 bytes 45257237 pkt (dropped 108091, overlimits 0 requeues 213419) 
 backlog 0b 0p requeues 213419 
  maxpacket 1514 drop_overlimit 0 new_flow_count 1771734 ecn_mark 2325
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8040: parent 1:4 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms 
 Sent 154206340 bytes 609486 pkt (dropped 66, overlimits 0 requeues 279) 
 backlog 0b 0p requeues 279 
  maxpacket 1514 drop_overlimit 0 new_flow_count 255 ecn_mark 0
  new_flows_len 0 old_flows_len 0

So, like, I wipe out that qdisc… and try exercising the CS1 or CS5 (BK or VI) queues to no effect

d@ida:~/public_html$ netperf -Y CS1,CS1 -H 172.21.18.1 -t TCP_MAERTS
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.21.18.1 () port 0 AF_INET : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00       0.00   
d@ida:~/public_html$ netperf -Y CS5,CS5 -H 172.21.18.1 -t TCP_MAERTS
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.21.18.1 () port 0 AF_INET : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00       0.00   

root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# tc -s qdisc show dev wlan1
qdisc mq 1: root 
 Sent 33368 bytes 183 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc sfq 8041: parent 1:1 limit 127p quantum 1514b depth 127 divisor 1024 
 Sent 145 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc sfq 8042: parent 1:2 limit 127p quantum 1514b depth 127 divisor 1024 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc sfq 8043: parent 1:3 limit 127p quantum 1514b depth 127 divisor 1024 
 Sent 33223 bytes 182 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc sfq 8044: parent 1:4 limit 127p quantum 1514b depth 127 divisor 1024 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
Updated by Dave Täht on May 14, 2014.
d@ida:~/public_html$ netperf -Y BE,BE -H 172.21.18.1 -t TCP_MAERTS
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.21.18.1 () port 0 AF_INET : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00      55.96  

d@ida:~/public_html$ netperf -Y CS1,CS1 -H 172.21.18.1 -t TCP_MAERTS
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.21.18.1 () port 0 AF_INET : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00       0.00   

root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# tc -s qdisc show dev wlan1
qdisc mq 1: root 
 Sent 73651767 bytes 48745 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc fq_codel 8046: parent 1:1 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
  maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8047: parent 1:2 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms ecn 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
  maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8048: parent 1:3 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms ecn 
 Sent 73651767 bytes 48745 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
  maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8049: parent 1:4 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
  maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0

doing some captures now

Updated by Dave Täht on May 14, 2014.
ok, so we get to the 10th netperf packet, which is a syn attempt, marked dscp 0x022 (or 0x08 if you prefer to shift it right). It is transmitted to my local (laptop) driver successfully. But it does not arrive at the destination. It tries to retransmit that syn a couple times and fails to get through.

Now, syn attempts marked this way usually work (and I can get the same behavior with udp) The checksum appears correct, as well. And here I have a case where it’s my client blowing up, not necessarily the router. So I’m going to reboot the client….

Anyway the netperf transaction fails at the 27th packet in the local2 capture, and is not received.

Updated by Dave Täht on May 14, 2014.
ok, a reboot of this client (ubuntu 3.11.0-19-generic) clears this problem. That doesn’t mean a lot…

I will add another client with a different chipset to try to blow that up from that. I am resuming beating the archer up, this time with both ipv4 and ipv6, from this client.

03:00.0 Network controller: Intel Corporation PRO/Wireless 5100 AGN [Shiloh] Network Connection

I am chasing possibly 3 separate bugs here.

Updated by Dave Täht on May 14, 2014.
I got it to re-occur in 20 minutes this time. Associating and disassociating from the iwl cleared it.

booting up a couple more boxes now…

Updated by David Taht on Jun 5, 2014.
turning off crypto doesn’t help

———- Forwarded message ———-
From: Jim Gettys jg@freedesktop.org
Date: Thu, Jun 5, 2014 at 9:19 AM
Subject: turning off crypto didn’t help.
To: Dave Taht dave.taht@gmail.com

The 2.4 ghz interface hung again last night….

  • Jim
Updated by Dave Täht on Jun 28, 2014.
after continously beating on the thing for days, I saw this go by in the dmesg.

[210237.781250] ———–[ cut here ]———–
[210237.789062] WARNING: at /build/cero2/src/cerowrt-3.10/build_dir/target-mips_34kc_uClibc-0.9.33.2/linux-ar71xx_generic/compat-wireless-2014-05-22/net/mac80211/rx.c:3372 ieee80211_rx+0x13c/0x7f8 [mac80211]()
[210237.804687] Rate marked as an HT rate but passed status->rate_idx is not an MCS index [0-76]: 92 (0x5c)
[210237.816406] Modules linked in: ath9k ath9k_htc ath9k_common iptable_nat ath9k_hw ath pppoe nf_nat_ipv4 nf_conntrack_ipv4 mac80211 cfg80211 xt_u32 xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state xt_recent xt_quota xt_pkttype xt_physdev xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_hashlimit xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_addrtype xt_TCPMSS xt_REDIRECT xt_LOG xt_IPMARK xt_HL xt_DSCP xt_CT xt_CLASSIFY usbnet ts_kmp ts_fsm ts_bm pptp pppox ppp_async nf_nat_irc nf_nat_ftp nf_defrag_ipv4 nf_conntrack_netlink nf_conntrack_irc nf_conntrack_ftp iptable_raw iptable_mangle iptable_filter ipt_REJECT ipt_MASQUERADE ipt_ECN ip_tables crc_ccitt compat_xtables compat sch_teql sch_tbf sch_sfq sch_red sch_qfq sch_prio sch_pie sch_ns2_codel sch_nfq_codel sch_netem sch_htb sch_gred sch_efq_codel sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp cls_basic act_police act_ipt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress leds_wndr3700_usb ledtrig_usbdev xt_set ip_set_list_set ip_set_hash_netport ip_set_hash_netiface ip_set_hash_net ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink sr_mod cdrom ip6t_NPT ip6t_MASQUERADE ip6table_nat nf_nat_ipv6 nf_nat ip6t_REJECT ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables nf_conntrack_ipv6 nf_conntrack nf_defrag_ipv6 pppoatm ppp_generic slhc ip_gre gre ifb nat46 sit ipip ip6_tunnel tunnel6 tunnel4 ip_tunnel tun vfat fat autofs4 br2684 atm nls_iso8859_2 nls_iso8859_15 nls_iso8859_13 nls_iso8859_1 nls_cp437 ipv6 authenc aead arc4 crypto_blkcipher usb_storage ohci_hcd ehci_platform ehci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 crc16 jbd2 mbcache usbcore nls_base usb_common crypto_hash
[210237.980468] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 3.10.44 #1
[210237.988281] Stack : 00000000 00000000 00000000 00000000 803a2eba 00000036 87828a58 86a1dd70
[210237.988281] 802f19d8 803437bb 00000003 803a2664 87828a58 86a1dd70 00000065 85c96618
[210237.988281] 803c0000 80079704 00000003 80077184 868cb40c 86a1dd70 802f32ac 87841c64
[210237.988281] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[210237.988281] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 87841bf0
[210237.988281] …
[210238.023437] Call Trace:
[210238.027343] [<8006e52c>] show_stack+0x48/0x70
[210238.031250] [<80077280>] warn_slowpath_common+0x78/0xa8
[210238.035156] [<800772dc>] warn_slowpath_fmt+0x2c/0x38
[210238.042968] [<8689f278>] ieee80211_rx+0x13c/0x7f8 [mac80211]
[210238.046875] [<86966410>] ath_rx_tasklet+0x96c/0x9b8 [ath9k]
[210238.050781] [<86963cd0>] ath9k_tasklet+0x1ac/0x230 [ath9k]
[210238.058593] [<8007ea04>] tasklet_action+0x84/0xcc
[210238.062500] [<8007e200>] __do_softirq+0xd0/0x1bc
[210238.066406] [<8007e318>] run_ksoftirqd+0x2c/0x58
[210238.074218] [<8009a7c4>] smpboot_thread_fn+0x134/0x164
[210238.078125] [<800938a8>] kthread+0xb0/0xb8
[210238.082031] [<80060878>] ret_from_kernel_thread+0x14/0x1c
[210238.085937]
[210238.089843] –[ end trace 57a62c568dfcbfb7 ]–
root@davedesk:~#

Updated by David Taht on Jul 6, 2014.
———- Forwarded message ———-
From: “Philip”
Date: Jul 6, 2014 10:18 AM
Subject: BufferBloat Bug 442
To: davetaht
Cc:

Hello

I am a general user running the netgear router with cerowrt. I have the
latest build version of 3.10.44-6/
http://snapon.lab.bufferbloat.net/%7Ecero2/cerowrt/wndr/3.10.44-6/. The
wifi drops sometimes during the day and I have to manually turn of the
router and turn it on. I noticed when I do heavy video streaming the wifi
will crash. Is their a stable version that I can downgrade to. Please let
me know

thanks

Updated by Felix Fietkau on Jul 23, 2014.
Hey,

with a lot of debugging help from Antonio Quartulli, I believe I finally
found and fixed the cause of this bug.
When aggregation sessions are set up and torn down frequently, the
driver queue can end up with frames marked for A-MPDU while an
aggregation session is not active (and often cannot be established anymore).
I committed the fix to it in r41815 (also sent to linux-wireless@).
Please test.

  • Felix
Updated by David Taht on Jul 23, 2014.
So far as I have heard the latest build IS more stable than anything
prior under conditions of low signal strength os OSX. (how’s everyone
doing this week?)

I had long suspected we were actually seeing several bugs masquerading as one.

new hope:

https://dev.openwrt.org/changeset/41815

Updated by Sebastian Moeller on Jul 24, 2014.
Hi Dave,

On Jul 24, 2014, at 07:48 , Dave Taht dave.taht@gmail.com wrote:

So far as I have heard the latest build IS more stable than anything
prior under conditions of low signal strength os OSX. (how’s everyone
doing this week?)

Still all fine (but uptime is 5 days, the last time I needed ~20 days for the queue to get stuck). Also I note that for me the problem typically develops on the 2.4GHz radio, which only hosts am old nexus7 and two nexus 4 (nexi?). The macbook and macbook pro on the 5GHz radio seems rather stable. I guess what I want to say is that the macs might be good in flushing out this issue, but it is not a mac only issue ;) I only tested under IPv4, but this seems to work well already…


I had long suspected we were actually seeing several bugs masquerading as one.

new hope:

https://dev.openwrt.org/changeset/41815

So, I will wait for a fortnight of uptime, before switching to a potential newer cerowrt version, just to see whether I can break 3.10.48-2…

Best Regards
Sebastian

Updated by David Taht on Aug 16, 2014.
I am told that several sites that had severe problems with wifi
hanging have now been up,
for over 2 weeks, without problems, with the 3.10.50-1 release of cerowrt.

How is everyone else doing?

Are we allowed to feel joy and relief at finally having a reasonably
stable release yet?

Updated by Sebastian Moeller on Aug 16, 2014.
Hi R.

what is the output of:
cat sys/kernel/debug/ieee80211/phy0/ath9k/queues

and
cat sys/kernel/debug/ieee80211/phy1/ath9k/queues

when it gets stuck? I wonder whether you see the actual same bug as #442 or some other bug. (I think the current theory is that a number of bugs contributed to the symptoms we described as #442 and now we have to tease them apart one by one).

Best Regards
Sebastian

On Aug 16, 2014, at 21:07 , R. redag2@gmail.com wrote:

Had to move my client devices from WPA2 to open AP, as I was getting daily failures. I did not experience the stability that you talk of. :(

Manually rebooting at least once a week also happens as dhcp server fails.

On Aug 16, 2014 2:55 PM, “Daniel Ezell” dezell@stonescry.com wrote:
No drops here since installing. Looks great to me.
Daniel

On Aug 16, 2014 11:23 AM, “Dave Taht” dave.taht@gmail.com wrote:
I am told that several sites that had severe problems with wifi
hanging have now been up,
for over 2 weeks, without problems, with the 3.10.50-1 release of cerowrt.

How is everyone else doing?

Are we allowed to feel joy and relief at finally having a reasonably
stable release yet?

Updated by Rich Brown on Aug 16, 2014.

I will try to keep an eye and report back. You know what would be
really useful? A breakdown of all the useful logs that one would need
to provide when reporting on a bug.

Even more user-friendly would be a script that generates all relevant
information/logs to debugging. Perhaps one day? :)

Check the cerostats.sh script that’s in /usr/lib/CeroWrtScripts for recent CeroWrt builds. That collects a number of interesting stats and puts them in /tmp/cerostats_output.txt

Rich

Updated by guozheng qian on Aug 21, 2014.
Dear Filex, I am using your patch at https://dev.openwrt.org/changeset/41815, merge to openwrt svn reversion 41808, and test the wireless performance, still found the dmesg like below, does these message interfere with the stability of the firmware, btw, I am using TP-Link tl-wr841n-v8 hardware.

root@YSWiFi:/# [ 437.230000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 489.680000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 495.910000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 515.950000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 529.190000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 530.000000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[ 639.230000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[ 828.800000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 849.440000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 856.640000] ath: phy0: Failed to stop TX DMA, queues=0x004!

Updated by Jim Gettys on Oct 7, 2014.
ah, at last…. Ding, dong, the witch is dead!

This is a static export of the original bufferbloat.net issue database. As such, no further commenting is possible; the information is solely here for archival purposes.
RSS feed

Recent Updates

Oct 20, 2023 Wiki page
What Can I Do About Bufferbloat?
Dec 3, 2022 Wiki page
Codel Wiki
Jun 11, 2022 Wiki page
More about Bufferbloat
Jun 11, 2022 Wiki page
Tests for Bufferbloat
Dec 7, 2021 Wiki page
Getting SQM Running Right

Find us elsewhere

Bufferbloat Mailing Lists
#bufferbloat on Twitter
Google+ group
Archived Bufferbloat pages from the Wayback Machine

Sponsors

Comcast Research Innovation Fund
Nlnet Foundation
Shuttleworth Foundation
GoFundMe

Bufferbloat Related Projects

OpenWrt Project
Congestion Control Blog
Flent Network Test Suite
Sqm-Scripts
The Cake shaper
AQMs in BSD
IETF AQM WG
CeroWrt (where it all started)

Network Performance Related Resources


Jim Gettys' Blog - The chairman of the Fjord
Toke's Blog - Karlstad University's work on bloat
Voip Users Conference - Weekly Videoconference mostly about voip
Candelatech - A wifi testing company that "gets it".