We have thus far ruled out syn flood protection, & 6in4 encapsulation.
Some users never see
the problem, others can get it to happen in a few hours.
———- Forwarded message ———-
From: Dave Taht dave.taht@gmail.com
Date: Sat, Apr 5, 2014 at 9:15 AM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.34-4 dev build released
To: Neil Shepperd nshepperd@gmail.com
Cc: “cerowrt-devel@lists.bufferbloat.net”
cerowrt-devel@lists.bufferbloat.net
In_trying_to_sort_out_the_differences_between_the_people
working_wifi_for_long_periods,vs_those_without…
I_am_curious_if_your_country
code_is_set,and_what_it_is_set_to,and_your_wifi_channel_set
It_is_long_past_time_we_start_up_a_formal_bug_for_this,
but_I’ll_wait_for_my_spacebar.
In_a_known_pretty_good_case:
root@lorna-gw:~# cat /etc/openwrt_release
DISTRIB_ID=“CeroWrt”
DISTRIB_RELEASE=“3.10.32-9”
DISTRIB_REVISION=“r39917”
DISTRIB_CODENAME=“toronto”
DISTRIB_TARGET=“ar71xx/generic”
DISTRIB_DESCRIPTION=“CeroWrt Toronto 3.10.32-9”
DISTRIB_TAINTS=“no-all busybox”
root@lorna-gw:~# uptime
16:07:37 up 21 days, 21:35, load average: 0.00, 0.01, 0.04
root@lorna-gw:~# egrep -i “country|channel|htmode”
/etc/config/wireless
option channel 11
option htmode HT20
option channel ‘44’
option htmode HT40+
option country ‘US’
On Sat, Apr 5, 2014 at 9:02 AM, Dave Taht dave.taht@gmail.com wrote:
> On Sat, Apr 5, 2014 at 5:49 AM, Neil Shepperd nshepperd@gmail.com
wrote:
>>> Sounds like you are going to stick with -4 for a bit?
>>
>> Actually, this is the first time I’ve tried cerowrt on a
router. But
>> yeah, I’ll stick with the current version unless you come out
with a new
>> patch to try.
>
> Thx. I am hoping this is the last priority 1 bug cerowrt has.
>
> but_fixing_it_is_going_to_be_pita.
>
> I_confess_to_“embedded_fatigue”.
>
>>> what I’ve been doing is mounting a usb stick, and just
running continuously
>>> on the stick
>>>
>>> tcpdump -s 128 -i ge00 -w ge00.cap &
>>> tcpdump -s 128 -i sw00 -w sw00.cap &
>>>
>>> This definately hurts performance…
>>>
>>> And it’s probably time to do a tcpdump on the connected
device as well.
>>>
>>
>> Update: I did this, and experienced the hang again. A first
look at the
>> tcpdump output on sw00 shows a sudden reduction in traffic at
20:40:54,
>> so I assume that’s probably the time of the event. After that,
I see
>> many DHCP and ARP requests arriving, but no responses leaving
the interface.
>
>
It_would_be_nice_to_see_10sec_of_these_captures_before_and_after.
>
>>
>> In fact, I don’t see anything leaving except, oddly, some DNS
responses
>> (which are indeed received by my laptop). I also see some EAPOL
stuff on
>> both the router and laptop at roughly the same time, so I guess
that’s
>> getting through, but I don’t know the direction.
>>
>> I think next time I’ll try with -Pin/-Pout to separate incoming
and
>> outgoing packets properly…
>
> Tis easier_to_sort_in_wireshark_against_one_capture,IMHO.
>
>
I_have_been_looking_for_failed_syn_attempts_and_retries_as_a_key_indicator
> that_something_Bad_happened.
>
>>> Hmm. OK, this brings back the device driver into the
equation… I
>>> WAS seeing dhcp and arp requests “getting through” from the
captures,
>>> and it seemed like arp in particular was getting
through…
>>
>> So I guess this is only half right? What I see in syslog is
dnsmasq
>> saying it has sent a packet, but it doesn’t make it onto the
interface.
>> Apart from DNS packets, so I don’t know what to make of that.
>
> It_is_possible_there_are_a_variety_of_failure_modes.
>
>
I_am_not_entirely_convinced_this_is_actually_a_wifi_specific_failure.
>
>
can_you_try_ssh_to_the_router_during_a_failure,and/or_accessing
> the_web_admin_interface?and/or_trying_to
>
>
if_you_are_not_using_babel_disable_it.It_makes_a_lot_of_updates
> to_the_routing_table.that_might_be_malfunctioning..
>
>
(I_really_need_a_keyboard_that_recovers_from_damp_weather.)
>
>> Neil
>
>
>
I have an OSX laptop on 5ghz, a Linux desktop and server via ethernet,
Linux Laptop via 5gz, Roku via 5gz, Nexus 7 via 5gz, and misc other
devices… I didn’t get my total bandwidth on 3.10.32-12, 3.10.34-1,
but I’ve done 3.3GB down 0.9GB up since flashing 3.10.34-4. I’ve had
no problems on any of those builds. It’s been rock solid for me. I
work from home two days a week (Tues and Thurs), wireless connection
via my work OSX laptop. Since the 3.10.x series, I’ve noticed that
WiFi has been noticeably faster. If there is a roll-back of the
kernel, would it be possible to have a fork still with the latest
kernel too… otherwise how will it be known when the issue is fixed,
sorry to be a PitA.
On Fri, Apr 4, 2014 at 12:58 AM, Dave Taht dave.taht@gmail.com wrote:
>
> On Thu, Apr 3, 2014 at 3:57 PM, Aaron Wood woody77@gmail.com
wrote:
> > On Fri, Apr 4, 2014 at 12:56 AM, Aaron Wood
woody77@gmail.com wrote:
> >>
> >> Up for 10 days on 3.10.32-12 (WNDR3800). Only have 2
devices that run
> >> 2.4GHz, and it’s only seen 2GB of traffic on SW00 in that
time… The 5GHz
> >> radio has had >5GB of traffic on it in the same time.
No problems at all.
> >
> >
> > And I also have both 2.4 and 5GHz babel and guest SSIDs all
turned off.
> >
> > -Aaron
>
> Your clients are?
>
> So far there seems to be a significant trend towards osx being an
issue…
iOS 7 (a pair of iPhone 4’s). Everything that supports 5GHz is using 5GHz.
-Aaron
The last release without wifi issues was 3.8.something (I think it was
called Berlin). The whole 3.10.x branch seems to have broken wifi
(will see how 3.10.24-4 goes, it seems OK, but it’s been working less
than 24hours yet).
I’m using only 2.4Ghz (5Ghz dead in the water - devices couldn’t
connect at all, so I disabled it). Guest and babel disabled.
Regards,
Max
On Fri, Apr 4, 2014 at 11:36 AM, Dave Taht dave.taht@gmail.com wrote:
>
> Is there a recent version that people had that was seemingly stable
for
> wifi that we could step back to and bisect from? Something where
> you had heavy wifu use for week(s) without a problem?
>
> (I know that until we got focused on this, and people focused on
> reporting it, that maybe it was happening in releases I’d
otherwise
> considered to be “pretty good”… so please report in on your
“best”
> releases this year…)
>
> Worst case we can step back to that kernel for a while and proceed
forward
> on all the other stuff. I know I crave stability at this point, and
I’m
> unhappy that everyone here is unhappy, too…
>
> Regrettably since losing my lab I have not been in a position to
easily
> test wifi to any huge extent. I’m slowly building that up (but for
example
> no longer have a mac to test with)
>
>
> On Thu, Apr 3, 2014 at 11:20 AM, Neil Shepperd
nshepperd@gmail.com wrote:
> > I just flashed 3.10.34-4 to my new WNDR3800 and experienced
the exact
> > wifi hang described by Toke Høiland-Jørgensen. But I’m on the
2.4GHz
> > network (with guest and babel disabled). Unfortunately I
didn’t think to
> > try tracing anything from the router side before resetting the
wireless.
>
> cool you disabled guest and babel. So far we’ve sort of ruled out
> 6in4 tunnelling, and syn flood protection.
>
> Sounds like you are going to stick with -4 for a bit?
>
> what I’ve been doing is mounting a usb stick, and just running
continuously
> on the stick
>
> tcpdump -s 128 -i ge00 -w ge00.cap &
> tcpdump -s 128 -i sw00 -w sw00.cap &
>
> This definately hurts performance…
>
> And it’s probably time to do a tcpdump on the connected device as
well.
>
> In terms of other diags… (any suggestions?)
>
> > Syslog was filled with a lot of
> >
> > DHCPDISCOVER (sw00) [MAYBE IP] [MAC ADDRESS]
> > DHCPOFFER (sw00) [IP] [MAC ADDRESS]
>
> Hmm. OK, this brings back the device driver into the equation… I
> WAS seeing dhcp and arp requests “getting through” from the
captures,
> and it seemed like arp in particular was getting through…
>
> >
> > but the offers aren’t being received at my laptop.
> >
> > Just another data point I guess.
>
> Well, I’d hoped it would be a confirming one rather than one
opening
> up more questions.
>
> > Neil
> > **_
> > Cerowrt-devel mailing list
> > Cerowrt-devel@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
>
>
n Wed, Apr 2, 2014 at 9:48 PM, Stephen Hemminger
stephen@networkplumber.org wrote:
>
> I am seeing wireless hang as well.
> Mostly when multiple macbooks are active on 2.4g
>
>
Also true in my house: both kids are on Macbooks.
But I’ve seen the problem with no-one but me (on Linux) around, so all
that says is that if the router is in use more, you see more failures.
So I’m not sure I can draw much from this experience.
3.10.34-3.10.36 does not seem to have any relevant patches, but I just
updated to 3.10.36 anyway.
In openwrt head, there has been a problem in dhcpv6 renews, which you
can see on the dhcpv6 web page after a day or so. That looks to be
fixed now.
So I just merged from openwrt head.
I try to be happy that most of our problems are now taking days to crop up.
I will probably produce a topic branch at this point which will have
heavy
levels of debugging enabled. I’d like to be able to trace packets from
origin to (non) exit, somehow…
root@outpost:~# cat /etc/openwrt_release
DISTRIB_ID=“CeroWrt”
DISTRIB_RELEASE=“3.10.34-4”
DISTRIB_REVISION=“r40361”
DISTRIB_CODENAME=“toronto”
DISTRIB_TARGET=“ar71xx/generic”
DISTRIB_DESCRIPTION=“CeroWrt Toronto 3.10.34-4”
DISTRIB_TAINTS=“no-all busybox”
root@outpost:~# uptime
22:52:44 up 2 days, 11:16, load average: 0.00, 0.01, 0.04
root@outpost:~# egrep -i “country|channel|htmode”
/etc/config/wireless
option channel 11
option htmode HT40-
option channel 36
option htmode HT40+
root@cerowrt:~# cat /etc/openwrt_release
DISTRIB_ID=“CeroWrt”
DISTRIB_RELEASE=“3.10.34-4”
DISTRIB_REVISION=“r40361”
DISTRIB_CODENAME=“toronto”
DISTRIB_TARGET=“ar71xx/generic”
DISTRIB_DESCRIPTION=“CeroWrt Toronto 3.10.34-4”
DISTRIB_TAINTS=“no-all busybox”
root@cerowrt:~# uptime
12:22:42 up 1 day, 18:37, load average: 0.04, 0.04, 0.05
root@cerowrt:~# egrep -i “country|channel|htmode”
/etc/config/wireless
option htmode ‘HT20’
option country ‘AU’
option channel ‘auto’
option htmode ‘HT20’
option country ‘AU’
option channel ‘auto’
It_would_be_nice_to_see_10sec_of_these_captures_before_and_after.
Uploaded at http://zlkj.in/files/wireshark/. I filtered the captures in
wireshark for frame.time > “April 5, 2014 20:40:44” which is about
10
seconds before the bug. wlan0.cap is the capture from my laptop.
ppp.cap
is from the pppoe connection on ge00.
It_is_possible_there_are_a_variety_of_failure_modes.
I_am_not_entirely_convinced_this_is_actually_a_wifi_specific_failure.
can_you_try_ssh_to_the_router_during_a_failure,and/or_accessing
the_web_admin_interface?and/or_trying_to
I can ssh in and access the admin interface if I connect my laptop by
an
ethernet cable. But during the failure, I can’t access the admin
interface or the internet over sw00. After resetting sw00 by admin
interface on se00, I can connect over the wireless again.
if_you_are_not_using_babel_disable_it.It_makes_a_lot_of_updates
to_the_routing_table.that_might_be_malfunctioning..
I thought I disabled babel, but I’m still seeing babel packets in the
capture, so I guess disabling the “babel” networks on both radios in
the
wifi tab is not enough.
What I’m doing at the moment is capturing the mon0 interface with
wireshark while beating up the network as much as I can. (and trying
to come up with ways to parse the results sanely)
http://wiki.wireshark.org/CaptureSetup/WLAN
There are some instructions for BSD OSX in there too.
There isn’t a way to do this in windows, apparently, without a special device:
The background wifi queue (1:40) gets wedged.
This explains why this only seemed to happen on comcast (Which
re-marks a LOT of traffic
background that it shouldn’t, and yes we should start mangling packets
back to “be” in sqm
as an option), and why local traffic seemed to mostly work when stuff
coming back from the internet didn’t.
As to why it happens, don’t know. I’m sitting in the #bufferbloat
channel
scratching my head as to means to explore the problem without
unwedging the interface.
It seems plausible we can MUCH more easily reproduce this now by
flooding the
background queues with traffic (netperf can do this). It’s not clear
you can trigger it
with just tcp however or if multiple hops are required, etc, etc.
root@cerowrt:/mnt/disk1# tc -s qdisc show dev sw00
qdisc mq 1: root
Sent 3926131082 bytes 2998293 pkt (dropped 91657, overlimits 0 requeues
70095)
backlog 77608b 1000p requeues 70095
qdisc fq_codel 10: parent 1:1 limit 800p flows 1024 quantum 500 target
10.0ms interval 100.0ms
Sent 110555 bytes 771 pkt (dropped 0, overlimits 0 requeues 5)
backlog 0b 0p requeues 5
maxpacket 256 drop_overlimit 0 new_flow_count 2 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 20: parent 1:2 limit 800p flows 1024 quantum 300 target
5.0ms interval 100.0ms ecn
Sent 2526448 bytes 17982 pkt (dropped 1, overlimits 0 requeues 31)
backlog 0b 0p requeues 31
maxpacket 929 drop_overlimit 0 new_flow_count 71 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 30: parent 1:3 limit 1000p flows 1024 quantum 300
target 5.0ms interval 100.0ms ecn
Sent 15145657 bytes 106290 pkt (dropped 0, overlimits 0 requeues 179)
backlog 0b 0p requeues 179
maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 40: parent 1:4 limit 1000p flows 1024 quantum 300
target 5.0ms interval 100.0ms
Sent 3908348422 bytes 2873250 pkt (dropped 91656, overlimits 0 requeues
69880)
backlog 77608b 1000p requeues 69880
^^\^![]()![]()!
maxpacket 1514 drop_overlimit 72128 new_flow_count 85727 ecn_mark 0
new_flows_len 238 old_flows_len 1
I got the “wedged” interface to work again re-marking all tcp traffic
as best effort”
iptables -A FORWARD -o sw00 -t mangle -p tcp -m tcp -j DSCP –set-dscp-class be
thus moving traffic into 1:3 above.
(can probably improve on this iptables thing, but it’s just a
workaround and for all I know we can also trigger this on the be
queue)
icmp replies however, seems to want to always go into the background
queue for some reason. (?)
We did have this happen earlier on this run
[31325.589843] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32380.960937] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32381.035156] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32381.140625] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32381.242187] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32381.343750] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32418.824218] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32445.863281] ath: phy0: Failed to stop TX DMA, queues=0x108!
[32445.960937] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.062500] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.164062] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.265625] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.367187] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.472656] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.574218] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.683593] ath: phy0: Failed to stop TX DMA, queues=0x00c!
[32446.777343] ath: phy0: Failed to stop TX DMA, queues=0x008!
[32446.886718] ath: phy0: Failed to stop TX DMA, queues=0x009!
[34701.062500] ath: phy0: Failed to stop TX DMA, queues=0x008!
[34701.140625] ath: phy0: Failed to stop TX DMA, queues=0x008!
[34701.242187] ath: phy0: Failed to stop TX DMA, queues=0x008!
root@cerowrt:/sys/kernel/debug/mips# cat unaligned_instructions
1154
and we are also using a very short qlen_be and qlen_bk = 12
and the debloat script tosses stuff on md’s queues 1:1,1:2,1:3,1:4 rather than the default and invisible md 0:1, etc.
While saturating the be queue with a couple netperfs, I get:
root@cerowrt:/sys/kernel/debug/ieee80211/phy0/netdev:sw00/stations/00:15:6d:84:b3:00#
cat rc_stats
type rate throughput ewma prob this prob retry this succ/attempt success
attempts
CCK/LP 1.0M 0.7 96.3 100.0 0 0( 0) 973 1003
CCK/SP 2.0M 1.5 100.0 100.0 0 0( 0) 1 1
CCK/SP 5.5M 3.8 100.0 100.0 0 0( 0) 1 1
CCK/SP 11.0M 6.1 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS0 5.7 100.0 100.0 3 0( 0) 1 1
HT20/LGI MCS1 11.5 95.7 100.0 0 0( 0) 12 13
HT20/LGI MCS2 16.7 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS3 21.9 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS4 31.5 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS5 40.1 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS6 44.0 96.2 100.0 0 0( 0) 18 20
HT20/LGI MCS7 48.8 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS8 11.5 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS9 21.9 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS10 31.5 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS11 40.1 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS12 56.1 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS13 68.0 95.6 100.0 0 0( 0) 16 18
HT20/LGI MCS14 74.9 100.0 100.0 0 0( 0) 1 1
HT20/LGI MCS15 80.2 100.0 100.0 6 0( 0) 1 1
HT40/LGI MCS0 11.9 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS1 22.8 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS2 32.5 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS3 41.6 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS4 57.6 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS5 70.1 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS6 77.5 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS7 83.3 96.0 100.0 5 0( 0) 301 319
HT40/LGI MCS8 22.8 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS9 41.6 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS10 57.6 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS11 70.1 100.0 100.0 0 0( 0) 1 1
HT40/LGI MCS12 93.6 99.7 100.0 6 0( 0) 45 46
HT40/LGI MCS13 107.1 95.6 100.0 5 0( 0) 2743 3151
HT40/LGI t MCS14 118.4 92.6 93.4 6 172(184) 53259 64221
HT40/LGI MCS15 94.1 67.8 100.0 6 2( 2) 29077 40909
HT40/SGI MCS0 13.2 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS1 25.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS2 35.5 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS3 45.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS4 62.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS5 75.2 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS6 82.7 100.0 100.0 5 0( 0) 9 9
HT40/SGI P MCS7 88.5 98.5 100.0 5 0( 0) 967 1145
HT40/SGI MCS8 25.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS9 45.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS10 62.1 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS11 75.2 100.0 100.0 0 0( 0) 1 1
HT40/SGI MCS12 99.0 96.6 100.0 5 0( 0) 821 876
HT40/SGI MCS13 112.4 95.7 100.0 6 1( 1) 88413 94617
HT40/SGI MCS14 86.6 63.1 0.0 6 0( 1) 715161 805305
HT40/SGI T MCS15 122.6 84.8 100.0 6 1( 1) 98281 127685
qdisc mq 1: root
Sent 6376179748 bytes 5429375 pkt (dropped 92662, overlimits 0 requeues
98880)
backlog 0b 0p requeues 98880
qdisc fq_codel 10: parent 1:1 limit 800p flows 1024 quantum 500 target
10.0ms interval 100.0ms
Sent 115759 bytes 807 pkt (dropped 0, overlimits 0 requeues 5)
backlog 0b 0p requeues 5
maxpacket 256 drop_overlimit 0 new_flow_count 2 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 20: parent 1:2 limit 800p flows 1024 quantum 300 target
5.0ms interval 100.0ms ecn
Sent 3053074 bytes 25673 pkt (dropped 1, overlimits 0 requeues 38)
backlog 0b 0p requeues 38
maxpacket 929 drop_overlimit 0 new_flow_count 73 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 30: parent 1:3 limit 1000p flows 1024 quantum 300 target
5.0ms interval 100.0ms ecn
Sent 2464586793 bytes 2528666 pkt (dropped 947, overlimits 0 requeues
28957)
backlog 0b 0p requeues 28957
maxpacket 1514 drop_overlimit 0 new_flow_count 82547 ecn_mark 1
new_flows_len 0 old_flows_len 1
qdisc fq_codel 40: parent 1:4 limit 1000p flows 1024 quantum 300 target
5.0ms interval 100.0ms
Sent 3908424122 bytes 2874229 pkt (dropped 91714, overlimits 0 requeues
69880)
backlog 0b 0p requeues 69880
maxpacket 1514 drop_overlimit 72166 new_flow_count 85740 ecn_mark 0
new_flows_len 1 old_flows_len 251
MPDUs Queued: 50 2314 223 652400
MPDUs Completed: 40281 85321 8731 616194
MPDUs XRetried: 242 2644 350 36901
Aggregates: 783024 486368 202 0
AMPDUs Queued HW: 0 0 0 0
AMPDUs Queued SW: 4456600 2875990 60600 695
AMPDUs Completed: 4416029 2786190 51561 0
AMPDUs Retried: 253185 110105 1007 0
AMPDUs XRetried: 96 3990 181 0
TXERR Filtered: 70 5281 169 77
FIFO Underrun: 0 0 0 0
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 4456648 2878145 60823 653095
TX-Bytes-All: 457325109 4000133941 7212777 113292607
HW-put-tx-buf: 334 188 128 330
HW-tx-start: 1385587 1596997 61534 653095
HW-tx-proc-desc: 1385547 1612853 61531 653067
TX-Failed: 0 0 0 0
root@cerowrt:/sys/kernel/debug/ieee80211/phy0/ath9k# cat xmit
BE BK VI VO
MPDUs Queued: 50 2314 223 652400
MPDUs Completed: 40286 85321 8731 616194
MPDUs XRetried: 242 2644 350 36901
Aggregates: 784312 486368 202 0
AMPDUs Queued HW: 0 0 0 0
AMPDUs Queued SW: 4464165 2875990 60600 695
AMPDUs Completed: 4423589 2786190 51561 0
AMPDUs Retried: 253680 110105 1007 0
AMPDUs XRetried: 96 3990 181 0
TXERR Filtered: 70 5281 169 77
FIFO Underrun: 0 0 0 0
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 4464213 2878145 60823 653095
TX-Bytes-All: 468998281 4000133941 7212777 113292607
HW-put-tx-buf: 334 188 128 330
HW-tx-start: 1388889 1596997 61534 653095
HW-tx-proc-desc: 1388849 1612853 61531 653067
TX-Failed: 0 0 0 0
root@cerowrt:~# tc -s qdisc show dev sw10
qdisc mq 1: root
Sent 3852715919 bytes 2982888 pkt (dropped 5360, overlimits 0 requeues
55107)
backlog 99468b 1000p requeues 55107
qdisc fq_codel 10: parent 1:1 limit 800p flows 1024 quantum 500 target
10.0ms interval 100.0ms
Sent 41188 bytes 292 pkt (dropped 0, overlimits 0 requeues 1)
backlog 0b 0p requeues 1
maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 20: parent 1:2 limit 800p flows 1024 quantum 300 target
5.0ms interval 100.0ms ecn
Sent 1792325 bytes 8919 pkt (dropped 0, overlimits 0 requeues 22)
backlog 0b 0p requeues 22
maxpacket 1514 drop_overlimit 0 new_flow_count 19 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 30: parent 1:3 limit 1000p flows 1024 quantum 300 target
5.0ms interval 100.0ms ecn
Sent 1537736330 bytes 1266113 pkt (dropped 2479, overlimits 0 requeues
19919)
backlog 99468b 1000p requeues 19919
maxpacket 1514 drop_overlimit 710 new_flow_count 16535 ecn_mark 14
new_flows_len 71 old_flows_len 1
qdisc fq_codel 40: parent 1:4 limit 1000p flows 1024 quantum 300 target
5.0ms interval 100.0ms
Sent 2313146076 bytes 1707564 pkt (dropped 2881, overlimits 0 requeues
35165)
backlog 0b 0p requeues 35165
maxpacket 1514 drop_overlimit 0 new_flow_count 22111 ecn_mark 0
new_flows_len 0 old_flows_len 0
root@cerowrt:/sys/kernel/debug/ieee80211/phy1/ath9k# cat queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 12 stopped: 1
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
1) It’s still uncertain that we have only been dealing with one wireless bug…
…but we can narrow down the jg was seeing to if - after a failure
happens and you can login on another radio or via ethernet - if you
see frames “pending”, that stay pending, in
the “queues” debug file:
root@comcast-gw:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 151 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
you’ve hit the bug.
Nothing short of a reboot will clear it, presently. Felix is looking into it.
In the interim there are two things you can do to make hitting it a
LOT more difficult,
at least so far, in testing 20+ hours we haven’t hit it again
A) Stop reducing qlen_be, qlen_bk, qlen_vi, & qlen_vo.
comment out line 1977 of /usr/sbin/debloat
…
local function wireless(model)
print(model)
if WCALLBACKS[model] ~= nil then
– wireless_qlen() – comment out this call
return WCALLBACKS[model]()
else
usage(“AQM model not found”)
end
return nil
end
…
and reboot.
This will return the qlen’s to very large values that are nearly
impossible to hit.
While this will have a negative effect on latency, it will improve
single station bandwidth somewhat, and make it much harder to hang the
queue. (I think/hope)
I will argue - at this point - it is better to have a slower box that
stays up for weeks than one that has core functionality crash after a
few hours or days.
Those of you that have been experiencing the wifi hangs, please make
this change,
and check in daily?
If anyone has a hang, please post the ath9 queues status as per above,
and tc -s qdisc output to bug 442.
B) Mash incoming diffserv traffic down to BE only.
I have some patches almost ready for sqm-scripts for this, partially tested.
I’ve pushed them to the ceropackages github repository for review and testing.
see commit log message here.
https://github.com/dtaht/ceropackages-3.10/commit/27eed160a67700caae85a4c8b3fff0eaa990cd27
I am pretty sure fixing only fix “A” is need for working around the bug here
See also: http://www.bufferbloat.net/issues/442#note-16
1) It’s still uncertain that we have only been dealing with one wireless bug…
…but we can narrow down the jg was seeing to if - after a failure
happens and you can login on another radio or via ethernet - if you
see frames “pending”, that stay pending, in
the “queues” debug file:
root@comcast-gw:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 151 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
you’ve hit the bug.
Nothing short of a reboot will clear it, presently. Felix is looking into it.
In the interim there are two things you can do to make hitting it a
LOT more difficult,
at least so far, in testing 20+ hours we haven’t hit it again
A) Stop reducing qlen_be, qlen_bk, qlen_vi, & qlen_vo.
comment out line 1977 of /usr/sbin/debloat
…
local function wireless(model)
print(model)
if WCALLBACKS[model] ~= nil then
– wireless_qlen() – comment out this call
return WCALLBACKS[model]()
else
usage(“AQM model not found”)
end
return nil
end
…
and reboot.
This will return the qlen’s to very large values that are nearly
impossible to hit.
While this will have a negative effect on latency, it will improve
single station bandwidth somewhat, and make it much harder to hang the
queue. (I think/hope)
I will argue - at this point - it is better to have a slower box that
stays up for weeks than one that has core functionality crash after a
few hours or days.
Those of you that have been experiencing the wifi hangs, please make
this change,
and check in daily?
If anyone has a hang, please post the ath9 queues status as per above,
and tc -s qdisc output to bug 442.
B) Mash incoming diffserv traffic down to BE only.
I have some patches almost ready for sqm-scripts for this, partially tested.
I’ve pushed them to the ceropackages github repository for review and testing.
see commit log message here.
https://github.com/dtaht/ceropackages-3.10/commit/27eed160a67700caae85a4c8b3fff0eaa990cd27
I am pretty sure fixing only fix “A” is need for working around the bug here
I also note that I thought I’d squashed dscp to BE in the 3.10.36-4
SQM simplest.qos AND simple.qos code, but was very tired that day and
probably missed something. Not that that helps - we managed to lock up
the BE queue last time too.
I don’t know if the number of stations matter or the number of macs
matter, or not. I will start even longer generation tests with more
stations as soon as I can, but I’m kind of wiped out right now.
———- Forwarded message ———-
From: Jim Gettys jg@freedesktop.org
Date: Fri, Apr 11, 2014 at 11:20 AM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.36-4 released
To: Dave Taht dave.taht@gmail.com
Unfortunately, the bug has recurred after a day and a half.
root@cerowrt:/sys/kernel/debug/ieee80211/phy0/ath9k# cat queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 278 stopped: 1
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
// my comments in //
/* Upon failure caller should free skb */
int ath_tx_start(struct ieee80211_hw *hw, struct sk_buff *skb,
struct ath_tx_control *txctl)
{
struct ieee80211_hdr *hdr;
struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
struct ieee80211_sta *sta = txctl->sta;
struct ieee80211_vif *vif = info->control.vif;
struct ath_softc *sc = hw->priv;
struct ath_txq *txq = txctl->txq;
struct ath_atx_tid *tid = NULL;
struct ath_buf *bf;
int q;
int ret;
ret = ath_tx_prepare(hw, skb, txctl);
if (ret)
return ret;
hdr = (struct ieee80211_hdr *) skb->data;
/*
* At this point, the vif, hw_key and sta pointers in the tx control
* info are no longer valid (overwritten by the ath_frame_info data.
*/
// I haven’t looked at what skb_get_queue_mapping can return yet
q = skb_get_queue_mapping(skb);
ath_txq_lock(sc, txq);
if (txq == sc->tx.txq_map[q] &&
++txq->pending_frames > sc->tx.txq_max_pending[q] &&
!txq->stopped) {
ieee80211_stop_queue(sc->hw, q);
txq->stopped = true;
// is there a difference between stopped and sleeping?
}
// So if the queue is not mapped properly we don’t increment pending
// frames. Also we are dependent on C processing the if left to right,
// which is a good assumption, but it leaves the ++txq as a side effect
if (txctl->an &&
ieee80211_is_data_present(hdr->frame_control))
tid = ath_get_skb_tid(sc, txctl->an, skb);
if (info->flags & IEEE80211_TX_CTL_PS_RESPONSE) {
ath_txq_unlock(sc, txq);
txq = sc->tx.uapsdq;
// So here we have a bit of state that changes after we’ve got some
pending
// state above that’s been changed. I imagine this lock could stay
unlocked
// for a while and lead to races elsewhere.
// haven’t a clue what tx.uapsdq is
ath_txq_lock(sc, txq);
} else if (txctl->an &&
ieee80211_is_data_present(hdr->frame_control)) {
WARN_ON(tid->ac->txq != txctl->txq);
if (info->flags & IEEE80211_TX_CTL_CLEAR_PS_FILT)
tid->ac->clear_ps_filter = true;
/*
* Add this frame to software queue for scheduling later
* for aggregation.
*/
TX_STAT_INC(txq->axq_qnum, a_queued_sw);
__skb_queue_tail(&tid->buf_q, skb);
if (!txctl->an->sleeping)
ath_tx_queue_tid(txq, tid);
// so if we’re not sleeping, queue it up
// and regardless if we’re sleeping or not, schedule it
ath_txq_schedule(sc, txq);
goto out;
}
// So if data is not present OR txctl->an is invalid OR
IEEE80211_TX_CTL_PS_RESPONSE is set in flags
/// we fall through to here.
bf = ath_tx_setup_buffer(sc, txq, tid, skb);
// if we fell through to here, tid can be null unless data was present
if (!bf) {
ath_txq_skb_done(sc, txq, skb);
if (txctl->paprd)
dev_kfree_skb_any(skb);
else
ieee80211_free_txskb(sc->hw, skb);
goto out;
}
// Well, I note that we incremented the frames earlier in some cases
// should they be decremented above?
bf->bf_state.bfs_paprd = txctl->paprd;
if (txctl->paprd)
bf->bf_state.bfs_paprd_timestamp = jiffies;
ath_set_rates(vif, sta, bf);
ath_tx_send_normal(sc, txq, tid, skb);
// Not clear as to why you set_rates here, and I assume
tx_send_normal
// sends a non-aggregate
out:
ath_txq_unlock(sc, txq);
return 0;
}
On 04/14/2014 07:16 PM, Dave Taht wrote:
>
> We have been trying to replicate a bug in seeing wifi connections
hanging
> in strange ways after tons of data is transferred… for several
months now.
>
> The symptoms varied, anything from multicast failing to background
or best
> effort traffic failing - from local access working with remote
access
> not working…
>
> Last week, we finally got a situation where we had enough debugging
on to see
> something that matches the symptoms we saw, in that one of the wifi
queues
> would hang and leave the overlying qdisc full of packets that
didn’t drain.
Sounds familiar…I had a relatively clean patch in the 3.9 days, but
had some
issues merging along the way and haven’t bothered to rebase it, so patch
is
not as clean as it used to be:
http://dmz2.candelatech.com/git/?p=linux-3.14.dev.y/.git;a=commitdiff;h=a34e34f46fbffc627dfc2d93c508f580fbaf29e2;hp=cce0d841338348c69ae6f7ef1b2bc8a6abea3fc4 http://dmz2.candelatech.com/git/?p=linux-3.14.dev.y/.git;a=commitdiff;h=3ecefa9c9f7eed21002dad7a6540d6d250297466;hp=134543c6fec7e28bf91272ce995b550b1bf73c62
I posted the patch to the mailing lists some time back..maybe a year or two ago.
If I recall, we could reproduce our problem fairly reliably by
stepping an attenuator
in 10 db steps while under load.
I’d be curious to know if you try it out and it works for you…
Thanks,
Ben
On Tue, Apr 15, 2014 at 11:47 AM, cerowrt@lists.bufferbloat.net
wrote:
>
> Issue #422 has been updated by Felix Fietkau.
>
>
> On 2014-04-15 06:06, Dave Taht wrote:
>> regrettably I am too wiped to look this over further right now,
but the patchset
>> seems very promising.
>>
>> I will review on a fresh brain in the morning. Other eyeballs
desired
>> - this will have to get patched on top of 3.14 and then
backported to
>> the 3.10 backport….
> The patch is a rather crude workaround which unfortunately will
not
> help with narrowing down the cause. Also, doing a chip reset
because a
> software queue is stuck is overkill.
>
> Please test if this patch helps. The tid->paused flag is no
longer
> necessary since my rework of the tx path.
> —
> — a/drivers/net/wireless/ath/ath9k/ath9k.h
> **+ b/drivers/net/wireless/ath/ath9k/ath9k.h
> @ -254,7 +254,6
@ struct ath_atx_tid {
>
> s8 bar_index;
> bool sched;
> - bool paused;
> bool active;
> };
>
> — a/drivers/net/wireless/ath/ath9k/xmit.c
> **+ b/drivers/net/wireless/ath/ath9k/xmit.c
> @ -107,9 +107,6
@ static void ath_tx_queue_tid(struct ath_
> {
> struct ath_atx_ac *ac = tid->ac;
>
> - if (tid->paused)
> - return;
> -
> if (tid->sched)
> return;
>
> @ -1407,7 +1404,6
@ int ath_tx_aggr_start(struct ath_softc
*
> ath_tx_tid_change_state(sc, txtid);
>
> txtid->active = true;
> - txtid->paused = true;
> *ssn = txtid->seq_start = txtid->seq_next;
> txtid->bar_index = -1;
>
> @ -1427,7 +1423,6
@ void ath_tx_aggr_stop(struct ath_softc
*
>
> ath_txq_lock(sc, txq);
> txtid->active = false;
> - txtid->paused = false;
> ath_tx_flush_tid(sc, txtid);
> ath_tx_tid_change_state(sc, txtid);
> ath_txq_unlock_complete(sc, txq);
> @ -1487,7 +1482,7
@ void ath_tx_aggr_wakeup(struct ath_softc
> ath_txq_lock(sc, txq);
> ac->clear_ps_filter = true;
>
> - if (!tid->paused && ath_tid_has_buffered(tid)) {
> + if (ath_tid_has_buffered(tid)) {
> ath_tx_queue_tid(txq, tid);
> ath_txq_schedule(sc, txq);
> }
> @ -1510,7 +1505,6
@ void ath_tx_aggr_resume(struct ath_softc
> ath_txq_lock(sc, txq);
>
> tid->baw_size = IEEE80211_MIN_AMPDU_BUF <<
sta->ht_cap.ampdu_factor;
> - tid->paused = false;
>
> if (ath_tid_has_buffered(tid)) {
> ath_tx_queue_tid(txq, tid);
> @ -1544,8 +1538,6
@ void ath9k_release_buffered_frames(struc
> continue;
>
> tid = ATH_AN_2_TID(an, i);
> - if (tid->paused)
> - continue;
>
> ath_txq_lock(sc, tid->ac->txq);
> while (nframes > 0) {
> @ -1844,9 +1836,6
@ void ath_txq_schedule(struct ath_softc *
> list_del(&tid->list);
> tid->sched = false;
>
> - if (tid->paused)
> - continue;
> -
> if (ath_tx_sched_aggr(sc, txq, tid, &stop))
> sent = true;
>
> @ -2698,7 +2687,6
@ void ath_tx_node_init(struct ath_softc
*
> tid->baw_size = WME_MAX_BA;
> tid->baw_head = tid->baw_tail = 0;
> tid->sched = false;
> - tid->paused = false;
> tid->active = false;
> __skb_queue_head_init(&tid->buf_q);
> __skb_queue_head_init(&tid->retry_q);
> —————————————-
> Bug #422: some dhcpv6 debugging
> https://www.bufferbloat.net/issues/422
>
> Author: David Taht
> Status: Closed
> Priority: Normal
> Assignee:
> Category:
> Target version:
>
>
> 1) dhvp6 stuff
>
>
> The failing command is this one.
>
>
> ubus call network.interface. notify_proto ‘{ [action]() 0,
[link-up]() true,
> [keep]() false, [ip6prefix]() [ “2001:db8:0:f00::\/56,375,600”
], [dns]() [
> “fec0:0:0:1::1” ], [dns_
>
> search]() [ “domain.example” ] }’
>
>
> When it should be like this
>
>
> ubus call network.interface.ge00 notify_proto ‘{ [action]() 0,
[link-up]()
> true, [keep]() false, [ip6prefix]() [
“2001:db8:0:f00::\/56,375,600” ],
> [dns]() [ “fec0:0:0:1::1” ], [dns_
>
> search]() [ “domain.example” ] }’
>
>
>
> So it appears that you try to call \$INTERFACE where in
setup_interface,
> it’s actually “\$device”…
>
>
> except that when I made that change, I still had nothing right
>
>
> however, when I called this with two args rather than one…
>
>
> odhcp6c -N try -P 60 -s /lib/netifd/dhcpv6.script ge00 ge00 &
>
>
> it did find ge00… and did the automagic prefix assignment to the
other
> interfaces…
>
>
> so there’s an off-by-one error somewhere… (and odhcp6c doesn’t
start,
> regardless)
>
> it fails also on exit also lacking that interface param
>
> + ubus call network.interface. notify_proto { [action]() 0,
[link-up]()
> false, [keep]() false }
>
> Elsewhere /lib/netifd/proto/dhcpv6.sh \$INTERFACE and \$config seem
to be
> confused
>
> proto_export “INTERFACE=\$config”
>
> and that STILL didn’t fix it.
>
> hope this helps.
>
> My files
>
> 6relayd:
>
> config server default
> option master ge01 # tried ge00 too
> list network lan # tried the alias for the firewall as well as
> the actual devices and/or this not at all
> list network se00
> list network sw00
> list network sw10
> list network gw00
> list network guest # same crazy idea
> option rd server
> option dhcpv6 server
> option fallback_relay ‘rd dhcpv6 ndp’
>
> network
>
> config interface se00
> option ‘ifname’ ‘se00’
> option ‘proto’ ‘static’
> option ‘ipaddr’ ‘172.26.34.1’
> option ‘netmask’ ‘255.255.255.224’
> option ‘ip6assign’ ‘64’
>
> config interface ge00
> option ‘ifname’ ‘ge00’
> option ‘proto’ ‘dhcp’
>
> config interface ge01
> option ifname @ge00
> option proto dhcpv6
> option ‘broadcast’ ‘1’
> option ‘metric’ ‘2048’
> option ‘reqprefix’ ‘60’
>
> (the reason for the metric is that I let babel assign default gws)
>
> 2) in going through the env variables trying to figure out the
“next prefix
> available” in the /etc/odhcp6c.user there’s no rollup list
somewhere of the
> prefixes actually assigned to the pool of interfaces. Am trying to
come up
> with the “right” way to integrate ahcp’s /128 concept
>
> 3) there doesn’t seem to be anything stopping you from running
multiple
> copies of odhcpd
>
> 4) No ntp server support. My other assumption is that things like
wins are
> common too, and I also use wpad…
>
>
Given that there seems to be a potential race in the code
review I did at:
http://www.bufferbloat.net/issues/442#note-22
another thought is to make the increment and decrement of
txq->pending_frame atomic, or to do a flush before the unlock
What tree is this patch against?
On Tue, Apr 15, 2014 at 11:46 AM, Felix Fietkau nbd@openwrt.org
wrote:
> On 2014-04-15 06:06, Dave Taht wrote:
>> regrettably I am too wiped to look this over further right now,
but the patchset
>> seems very promising.
>>
>> I will review on a fresh brain in the morning. Other eyeballs
desired
>> - this will have to get patched on top of 3.14 and then
backported to
>> the 3.10 backport….
> The patch is a rather crude workaround which unfortunately will
not
> help with narrowing down the cause. Also, doing a chip reset
because a
> software queue is stuck is overkill.
>
> Please test if this patch helps. The tid->paused flag is no
longer
> necessary since my rework of the tx path.
> —
> — a/drivers/net/wireless/ath/ath9k/ath9k.h
> **+ b/drivers/net/wireless/ath/ath9k/ath9k.h
> @ -254,7 +254,6
@ struct ath_atx_tid {
>
> s8 bar_index;
> bool sched;
> - bool paused;
> bool active;
> };
>
> — a/drivers/net/wireless/ath/ath9k/xmit.c
> **+ b/drivers/net/wireless/ath/ath9k/xmit.c
> @ -107,9 +107,6
@ static void ath_tx_queue_tid(struct ath_
> {
> struct ath_atx_ac *ac = tid->ac;
>
> - if (tid->paused)
> - return;
> -
> if (tid->sched)
> return;
>
> @ -1407,7 +1404,6
@ int ath_tx_aggr_start(struct ath_softc
*
> ath_tx_tid_change_state(sc, txtid);
>
> txtid->active = true;
> - txtid->paused = true;
> *ssn = txtid->seq_start = txtid->seq_next;
> txtid->bar_index = -1;
>
> @ -1427,7 +1423,6
@ void ath_tx_aggr_stop(struct ath_softc
*
>
> ath_txq_lock(sc, txq);
> txtid->active = false;
> - txtid->paused = false;
> ath_tx_flush_tid(sc, txtid);
> ath_tx_tid_change_state(sc, txtid);
> ath_txq_unlock_complete(sc, txq);
> @ -1487,7 +1482,7
@ void ath_tx_aggr_wakeup(struct ath_softc
> ath_txq_lock(sc, txq);
> ac->clear_ps_filter = true;
>
> - if (!tid->paused && ath_tid_has_buffered(tid)) {
> + if (ath_tid_has_buffered(tid)) {
> ath_tx_queue_tid(txq, tid);
> ath_txq_schedule(sc, txq);
> }
> @ -1510,7 +1505,6
@ void ath_tx_aggr_resume(struct ath_softc
> ath_txq_lock(sc, txq);
>
> tid->baw_size = IEEE80211_MIN_AMPDU_BUF <<
sta->ht_cap.ampdu_factor;
> - tid->paused = false;
>
> if (ath_tid_has_buffered(tid)) {
> ath_tx_queue_tid(txq, tid);
> @ -1544,8 +1538,6
@ void ath9k_release_buffered_frames(struc
> continue;
>
> tid = ATH_AN_2_TID(an, i);
> - if (tid->paused)
> - continue;
>
> ath_txq_lock(sc, tid->ac->txq);
> while (nframes > 0) {
> @ -1844,9 +1836,6
@ void ath_txq_schedule(struct ath_softc *
> list_del(&tid->list);
> tid->sched = false;
>
> - if (tid->paused)
> - continue;
> -
> if (ath_tx_sched_aggr(sc, txq, tid, &stop))
> sent = true;
>
> @ -2698,7 +2687,6
@ void ath_tx_node_init(struct ath_softc
*
> tid->baw_size = WME_MAX_BA;
> tid->baw_head = tid->baw_tail = 0;
> tid->sched = false;
> - tid->paused = false;
> tid->active = false;
> __skb_queue_head_init(&tid->buf_q);
> __skb_queue_head_init(&tid->retry_q);
>
What tree is this patch against?
mac80211 from OpenWrt trunk.
It and “stopped” are briefly unprotected along that code path.
> What tree is this patch against?
mac80211 from OpenWrt trunk.
Thx, will try your patch today.
- Felix
linux-3.14/drivers/net/wireless/ath/ath9k/xmit.c
ath_txq_lock(sc, txq);
if (txq == sc->tx.txq_map[q] &&
++txq->pending_frames > sc->tx.txq_max_pending[q] &&
!txq->stopped) {
ieee80211_stop_queue(sc->hw, q);
txq->stopped = true;
}
if (txctl->an &&
ieee80211_is_data_present(hdr->frame_control))
tid = ath_get_skb_tid(sc, txctl->an, skb);
if (info->flags & IEEE80211_TX_CTL_PS_RESPONSE) {
ath_txq_unlock(sc, txq);
txq = sc->tx.uapsdq;
^^
ath_txq_lock(sc, txq);
} else if (txctl->an &&
On Wed, Apr 16, 2014 at 9:55 AM, Felix Fietkau nbd@openwrt.org wrote:
> On 2014-04-16 17:34, Dave Taht wrote:
>> On Wed, Apr 16, 2014 at 6:11 AM, Felix Fietkau
nbd@openwrt.org wrote:
>>> On 2014-04-15 21:00, Dave Taht wrote:
>>>> Thx felix!
>>>>
>>>> Given that there seems to be a potential race in the
code
>>>> review I did at:
>>>>
>>>> http://www.bufferbloat.net/issues/442#note-22
>>>>
>>>> another thought is to make the increment and decrement
of
>>>>
>>>> txq->pending_frame atomic, or to do a flush before
the unlock
>>> I’m not convinced that there’s a race that involves
txq->pending_frames.
>>> There is no need to make the increment/decrement atomic,
because that
>>> variable is already protected by the txq lock.
>>
>> It and “stopped” are briefly unprotected along that code path.
> Where?
>
> - Felix
in ath_tx_start:
ath_txq_lock(sc, txq);
if (txq == sc->tx.txq_map[q] &&
++txq->pending_frames > sc->tx.txq_max_pending[q] &&
!txq->stopped) {
ieee80211_stop_queue(sc->hw, q);
txq->stopped = true;
}
in ath_txq_skb_done:
if (txq->stopped &&
txq->pending_frames < sc->tx.txq_max_pending[q]) {
ieee80211_wake_queue(sc->hw, q);
txq->stopped = false;
}
in ath_tx_start:
ath_txq_lock(sc, txq);
if (txq == sc->tx.txq_map[q] &&
++txq->pending_frames > sc->tx.txq_max_pending[q] &&
!txq->stopped) {
ieee80211_stop_queue(sc->hw, q);
txq->stopped = true;
}
in ath_txq_skb_done:
if (txq->stopped &&
txq->pending_frames < sc->tx.txq_max_pending[q]) {
ieee80211_wake_queue(sc->hw, q);
txq->stopped = false;
}
Didn’t think it would, still thought <= was more correct.
By the way, did you test my patch?
It is in the as yet untested 3.10.36-6 build, along with resetting qlen
down to 12 again to try to trigger the bug sooner.
http://snapon.lab.bufferbloat.net/~cero2/cerowrt/wndr/3.10.36-6/
> in ath_tx_start:
>
> ath_txq_lock(sc, txq);
> if (txq == sc->tx.txq_map[q] &&
> ++txq->pending_frames > sc->tx.txq_max_pending[q] &&
> !txq->stopped) {
> ieee80211_stop_queue(sc->hw, q);
> txq->stopped = true;
> }
>
> in ath_txq_skb_done:
>
> if (txq->stopped &&
> txq->pending_frames < sc->tx.txq_max_pending[q]) {
> ieee80211_wake_queue(sc->hw, q);
> txq->stopped = false;
> }
>
>
root@cerowrt:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 12 stopped: 1
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
running 3.10.38-1. 2.4ghz hung.
- Jim
root@cerowrt:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 12 stopped: 1
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
**_
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Felix, the patch you’d given me to try - did that make it upstream? it stopped applying to my code and I’d dropped it, can try to update it… but I’m seeing signs it’s higher in the stack.
Last night I downloaded and installed openwrt head onto an archer C7 v2 platform, and in about 4 hours got the BK and VI queues to fail using the rrul test, on a WPA2 psk misc enabled system, no fiddling with qlens. The BE queue is fine. So, now I’ve pretty much ruled out cerowrt’s hardware, and build, as the cause of the problem, and it seems like it is universal to the ath9k and/or openwrt. Some of what I see here might mean it’s not an ath9k problem either!
DISTRIB_ID=“OpenWrt”
DISTRIB_RELEASE=“Bleeding Edge”
DISTRIB_REVISION=“r40755”
DISTRIB_CODENAME=“barrier_breaker”
DISTRIB_TARGET=“ar71xx/generic”
DISTRIB_DESCRIPTION=“OpenWrt Barrier Breaker r40755”
DISTRIB_TAINTS=“”
I’ve finally got enough hardware up and the monitoring interface figured out enough to capture and decrypt packets in the air, but didn’t do that last night.
Anyway, this failure looks like this - BK queue is hosed, BE is not, netperf negotiates a connection, then netperf flips the tos bit and no data comes through:
d@ida:~/public_html/archer/overnight\$ netperf -Y CS1,CS1 -H
172.21.0.1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
172.21.0.1 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10\^6bits/sec
87380 16384 16384 10.00 0.00
To get here I ran the rrul test over and over (which exercises each queue using CS0, CS1, CS5, and EF markings. ) the data files are in http://snapon.lab.bufferbloat.net/~d/archer/overnight
http://snapon.lab.bufferbloat.net/~d/archer/overnight/normality.png # random sample from earlier in the night
http://snapon.lab.bufferbloat.net/~d/archer/overnight/normality2.png # shortly before it went boom
http://snapon.lab.bufferbloat.net/~d/archer/overnight/bye_vi_vo_queue.png # vi and vo go away
http://snapon.lab.bufferbloat.net/~d/archer/overnight/bye_bk_queue.png # bk queue goes away too
It is kind of interesting that the failures started happening just as people were waking up and getting on the internet (6am), so I will return to testing with more interference on the link….
There is no info in queues
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# cat queues
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
These failures failed long before the failure:
[ 593.440000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 635.940000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 648.130000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 1188.800000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 1626.470000] ath: phy1: Failed to stop TX DMA, queues=0x00e!
[ 1748.010000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 1766.240000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 2909.640000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3104.710000] ath: phy1: Failed to stop TX DMA, queues=0x004!
[ 3431.860000] Failed to load ipt action
[ 3431.950000] netem: version 1.3
[ 3555.790000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3561.930000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3586.300000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3671.600000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3756.900000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 3817.930000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4189.750000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4201.940000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4909.110000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4933.480000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 4939.520000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 5037.110000] ath: phy1: Failed to stop TX DMA, queues=0x004!
[ 5915.000000] ath: phy1: Failed to stop TX DMA, queues=0x00d!
[ 6152.780000] ath: phy1: Failed to stop TX DMA, queues=0x00f!
[ 6644.290000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 6668.560000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 6729.280000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 6735.420000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 6923.430000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 7882.410000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 8908.970000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 8921.060000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 9036.560000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[ 9097.290000] ath: phy1: Failed to stop TX DMA, queues=0x005!
[31969.060000] ath: phy1: Failed to stop TX DMA, queues=0x005!
Trying to send anything marked CS1. You’d think it would be trying, but
aggregates or tx bytes don’t budge.
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# cat xmit
BE BK VI VO
MPDUs Queued: 5 0 1042 24803
MPDUs Completed: 1541 2153 4316 40355
MPDUs XRetried: 1 2 33 60
Aggregates: 2309316 80520 1015730 0
AMPDUs Queued HW: 0 0 0 0
AMPDUs Queued SW: 53325311 637157 18719054 15612
AMPDUs Completed: 53323234 634851 18696242 0
AMPDUs Retried: 823483 14659 556819 0
AMPDUs XRetried: 524 151 19404 0
TXERR Filtered: 189 42 239 2
FIFO Underrun: 0 0 0 1
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 53325300 637157 18719995 40415
TX-Bytes-All: 270739492 1785710593602584059 6533793
HW-put-tx-buf: 3442803 225830 1447392 40415
HW-tx-start: 0 0 0 0
HW-tx-proc-desc: 3441390 224351 1446571 40329
TX-Failed: 0 0 0 0
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# killall netserver
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# netserver
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# cat xmit
BE BK VI VO
MPDUs Queued: 5 0 1042 24868
MPDUs Completed: 1541 2153 4316 40421
MPDUs XRetried: 1 2 33 60
Aggregates: 2309316 80520 1015730 0
AMPDUs Queued HW: 0 0 0 0
AMPDUs Queued SW: 53325413 637157 18719054 15613
AMPDUs Completed: 53323336 634851 18696242 0
AMPDUs Retried: 823483 14659 556819 0
AMPDUs XRetried: 524 151 19404 0
TXERR Filtered: 189 42 239 2
FIFO Underrun: 0 0 0 1
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 53325402 637157 18719995 40481
TX-Bytes-All: 270762029 1785710593602584059 6547198
HW-put-tx-buf: 3442905 225830 1447392 40481
HW-tx-start: 0 0 0 0
HW-tx-proc-desc: 3441492 224351 1446571 40395
TX-Failed: 0 0 0 0
And we don’t show any packets attempting to enter the bk queue (1:4)
either. (same test as above). Deleting and recreating the qdisc
doesn’t work either.
(I note that I am trying huge targets and intervals with some success
with the
longer qlens….)
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# tc -s qdisc show dev wlan1
qdisc mq 1: root
Sent 42609841871 bytes 63704518 pkt (dropped 123636, overlimits 0 requeues 291608)
backlog 0b 0p requeues 291608
qdisc fq_codel 803d: parent 1:1 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms
Sent 1044525 bytes 15565 pkt (dropped 0, overlimits 0 requeues 408)
backlog 0b 0p requeues 408
maxpacket 256 drop_overlimit 0 new_flow_count 201 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 803e: parent 1:2 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms ecn
Sent 10914015954 bytes 17822230 pkt (dropped 15479, overlimits 0 requeues 77502)
backlog 0b 0p requeues 77502
maxpacket 1514 drop_overlimit 0 new_flow_count 223799 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 803f: parent 1:3 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms ecn
Sent 31540575052 bytes 45257237 pkt (dropped 108091, overlimits 0 requeues 213419)
backlog 0b 0p requeues 213419
maxpacket 1514 drop_overlimit 0 new_flow_count 1771734 ecn_mark 2325
new_flows_len 0 old_flows_len 0
qdisc fq_codel 8040: parent 1:4 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms
Sent 154206340 bytes 609486 pkt (dropped 66, overlimits 0 requeues 279)
backlog 0b 0p requeues 279
maxpacket 1514 drop_overlimit 0 new_flow_count 255 ecn_mark 0
new_flows_len 0 old_flows_len 0
So, like, I wipe out that qdisc… and try exercising the CS1 or CS5 (BK or VI) queues to no effect
d@ida:~/public_html$ netperf -Y CS1,CS1 -H 172.21.18.1 -t TCP_MAERTS
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.21.18.1 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 0.00
d@ida:~/public_html$ netperf -Y CS5,CS5 -H 172.21.18.1 -t TCP_MAERTS
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.21.18.1 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 0.00
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# tc -s qdisc show dev wlan1
qdisc mq 1: root
Sent 33368 bytes 183 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc sfq 8041: parent 1:1 limit 127p quantum 1514b depth 127 divisor 1024
Sent 145 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc sfq 8042: parent 1:2 limit 127p quantum 1514b depth 127 divisor 1024
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc sfq 8043: parent 1:3 limit 127p quantum 1514b depth 127 divisor 1024
Sent 33223 bytes 182 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc sfq 8044: parent 1:4 limit 127p quantum 1514b depth 127 divisor 1024
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
d@ida:~/public_html$ netperf -Y BE,BE -H 172.21.18.1 -t TCP_MAERTS
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.21.18.1 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 55.96
d@ida:~/public_html$ netperf -Y CS1,CS1 -H 172.21.18.1 -t TCP_MAERTS
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.21.18.1 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 0.00
root@OpenWrt:/sys/kernel/debug/ieee80211/phy1/ath9k# tc -s qdisc show dev wlan1
qdisc mq 1: root
Sent 73651767 bytes 48745 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc fq_codel 8046: parent 1:1 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 8047: parent 1:2 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms ecn
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 8048: parent 1:3 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms ecn
Sent 73651767 bytes 48745 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 8049: parent 1:4 limit 1024p flows 1024 quantum 1514 target 30.0ms interval 300.0ms
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0
new_flows_len 0 old_flows_len 0
doing some captures now
Now, syn attempts marked this way usually work (and I can get the same behavior with udp) The checksum appears correct, as well. And here I have a case where it’s my client blowing up, not necessarily the router. So I’m going to reboot the client….
Anyway the netperf transaction fails at the 27th packet in the local2 capture, and is not received.
I will add another client with a different chipset to try to blow that up from that. I am resuming beating the archer up, this time with both ipv4 and ipv6, from this client.
03:00.0 Network controller: Intel Corporation PRO/Wireless 5100 AGN [Shiloh] Network Connection
I am chasing possibly 3 separate bugs here.
booting up a couple more boxes now…
———- Forwarded message ———-
From: Jim Gettys jg@freedesktop.org
Date: Thu, Jun 5, 2014 at 9:19 AM
Subject: turning off crypto didn’t help.
To: Dave Taht dave.taht@gmail.com
The 2.4 ghz interface hung again last night….
[210237.781250] ———–[ cut here ]———–
[210237.789062] WARNING: at
/build/cero2/src/cerowrt-3.10/build_dir/target-mips_34kc_uClibc-0.9.33.2/linux-ar71xx_generic/compat-wireless-2014-05-22/net/mac80211/rx.c:3372
ieee80211_rx+0x13c/0x7f8 [mac80211]()
[210237.804687] Rate marked as an HT rate but passed
status->rate_idx is not an MCS index [0-76]: 92 (0x5c)
[210237.816406] Modules linked in: ath9k ath9k_htc ath9k_common
iptable_nat ath9k_hw ath pppoe nf_nat_ipv4 nf_conntrack_ipv4
mac80211 cfg80211 xt_u32 xt_time xt_tcpudp xt_tcpmss xt_string
xt_statistic xt_state xt_recent xt_quota xt_pkttype xt_physdev
xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length
xt_hl xt_helper xt_hashlimit xt_ecn xt_dscp xt_conntrack
xt_connmark xt_connlimit xt_connbytes xt_comment xt_addrtype
xt_TCPMSS xt_REDIRECT xt_LOG xt_IPMARK xt_HL xt_DSCP xt_CT
xt_CLASSIFY usbnet ts_kmp ts_fsm ts_bm pptp pppox ppp_async
nf_nat_irc nf_nat_ftp nf_defrag_ipv4 nf_conntrack_netlink
nf_conntrack_irc nf_conntrack_ftp iptable_raw iptable_mangle
iptable_filter ipt_REJECT ipt_MASQUERADE ipt_ECN ip_tables
crc_ccitt compat_xtables compat sch_teql sch_tbf sch_sfq sch_red
sch_qfq sch_prio sch_pie sch_ns2_codel sch_nfq_codel sch_netem
sch_htb sch_gred sch_efq_codel sch_dsmark sch_codel em_text
em_nbyte em_meta em_cmp cls_basic act_police act_ipt act_skbedit
act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw
sch_hfsc sch_ingress leds_wndr3700_usb ledtrig_usbdev xt_set
ip_set_list_set ip_set_hash_netport ip_set_hash_netiface
ip_set_hash_net ip_set_hash_ipportnet ip_set_hash_ipportip
ip_set_hash_ipport ip_set_hash_ip ip_set_bitmap_port
ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink sr_mod
cdrom ip6t_NPT ip6t_MASQUERADE ip6table_nat nf_nat_ipv6 nf_nat
ip6t_REJECT ip6table_raw ip6table_mangle ip6table_filter ip6_tables
x_tables nf_conntrack_ipv6 nf_conntrack nf_defrag_ipv6 pppoatm
ppp_generic slhc ip_gre gre ifb nat46 sit ipip ip6_tunnel tunnel6
tunnel4 ip_tunnel tun vfat fat autofs4 br2684 atm nls_iso8859_2
nls_iso8859_15 nls_iso8859_13 nls_iso8859_1 nls_cp437 ipv6
authenc aead arc4 crypto_blkcipher usb_storage ohci_hcd
ehci_platform ehci_hcd sd_mod scsi_mod gpio_button_hotplug ext4
crc16 jbd2 mbcache usbcore nls_base usb_common crypto_hash
[210237.980468] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 3.10.44
#1
[210237.988281] Stack : 00000000 00000000 00000000 00000000 803a2eba
00000036 87828a58 86a1dd70
[210237.988281] 802f19d8 803437bb 00000003 803a2664 87828a58 86a1dd70
00000065 85c96618
[210237.988281] 803c0000 80079704 00000003 80077184 868cb40c 86a1dd70
802f32ac 87841c64
[210237.988281] 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000
[210237.988281] 00000000 00000000 00000000 00000000 00000000 00000000
00000000 87841bf0
[210237.988281] …
[210238.023437] Call Trace:
[210238.027343] [<8006e52c>] show_stack+0x48/0x70
[210238.031250] [<80077280>] warn_slowpath_common+0x78/0xa8
[210238.035156] [<800772dc>] warn_slowpath_fmt+0x2c/0x38
[210238.042968] [<8689f278>] ieee80211_rx+0x13c/0x7f8
[mac80211]
[210238.046875] [<86966410>] ath_rx_tasklet+0x96c/0x9b8
[ath9k]
[210238.050781] [<86963cd0>] ath9k_tasklet+0x1ac/0x230
[ath9k]
[210238.058593] [<8007ea04>] tasklet_action+0x84/0xcc
[210238.062500] [<8007e200>] __do_softirq+0xd0/0x1bc
[210238.066406] [<8007e318>] run_ksoftirqd+0x2c/0x58
[210238.074218] [<8009a7c4>] smpboot_thread_fn+0x134/0x164
[210238.078125] [<800938a8>] kthread+0xb0/0xb8
[210238.082031] [<80060878>]
ret_from_kernel_thread+0x14/0x1c
[210238.085937]
[210238.089843] –[ end trace 57a62c568dfcbfb7 ]–
root@davedesk:~#
Hello
I am a general user running the netgear router with cerowrt. I have the
latest build version of 3.10.44-6/
http://snapon.lab.bufferbloat.net/%7Ecero2/cerowrt/wndr/3.10.44-6/.
The
wifi drops sometimes during the day and I have to manually turn of the
router and turn it on. I noticed when I do heavy video streaming the
wifi
will crash. Is their a stable version that I can downgrade to. Please
let
me know
thanks
with a lot of debugging help from Antonio Quartulli, I believe I
finally
found and fixed the cause of this bug.
When aggregation sessions are set up and torn down frequently, the
driver queue can end up with frames marked for A-MPDU while an
aggregation session is not active (and often cannot be established
anymore).
I committed the fix to it in r41815 (also sent to linux-wireless@).
Please test.
I had long suspected we were actually seeing several bugs masquerading as one.
new hope:
On Jul 24, 2014, at 07:48 , Dave Taht dave.taht@gmail.com wrote:
So far as I have heard the latest build IS more stable than anything
prior under conditions of low signal strength os OSX. (how’s everyone
doing this week?)
Still all fine (but uptime is 5 days, the last time I needed ~20 days for the queue to get stuck). Also I note that for me the problem typically develops on the 2.4GHz radio, which only hosts am old nexus7 and two nexus 4 (nexi?). The macbook and macbook pro on the 5GHz radio seems rather stable. I guess what I want to say is that the macs might be good in flushing out this issue, but it is not a mac only issue ;) I only tested under IPv4, but this seems to work well already…
I had long suspected we were actually seeing several bugs masquerading as one.
new hope:
https://dev.openwrt.org/changeset/41815
So, I will wait for a fortnight of uptime, before switching to a potential newer cerowrt version, just to see whether I can break 3.10.48-2…
Best Regards
Sebastian
How is everyone else doing?
Are we allowed to feel joy and relief at finally having a reasonably
stable release yet?
what is the output of:
cat sys/kernel/debug/ieee80211/phy0/ath9k/queues
and
cat sys/kernel/debug/ieee80211/phy1/ath9k/queues
when it gets stuck? I wonder whether you see the actual same bug as #442 or some other bug. (I think the current theory is that a number of bugs contributed to the symptoms we described as #442 and now we have to tease them apart one by one).
Best Regards
Sebastian
On Aug 16, 2014, at 21:07 , R. redag2@gmail.com wrote:
Had to move my client devices from WPA2 to open AP, as I was getting daily failures. I did not experience the stability that you talk of. :(
Manually rebooting at least once a week also happens as dhcp server fails.
On Aug 16, 2014 2:55 PM, “Daniel Ezell” dezell@stonescry.com wrote:
No drops here since installing. Looks great to me.
Daniel
On Aug 16, 2014 11:23 AM, “Dave Taht” dave.taht@gmail.com wrote:
I am told that several sites that had severe problems with wifi
hanging have now been up,
for over 2 weeks, without problems, with the 3.10.50-1 release of cerowrt.
How is everyone else doing?
Are we allowed to feel joy and relief at finally having a reasonably
stable release yet?
I will try to keep an eye and report back. You know what would be
really useful? A breakdown of all the useful logs that one would need
to provide when reporting on a bug.
Even more user-friendly would be a script that generates all relevant
information/logs to debugging. Perhaps one day? :)
Check the cerostats.sh script that’s in /usr/lib/CeroWrtScripts for recent CeroWrt builds. That collects a number of interesting stats and puts them in /tmp/cerostats_output.txt
Rich
root@YSWiFi:/# [ 437.230000] ath: phy0: Failed to stop TX DMA,
queues=0x005!
[ 489.680000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 495.910000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 515.950000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 529.190000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 530.000000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[ 639.230000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[ 828.800000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 849.440000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[ 856.640000] ath: phy0: Failed to stop TX DMA, queues=0x004!
From: Dave Taht dave.taht@gmail.com
Date: Wed, Apr 2, 2014 at 7:43 PM
Subject: Re: [Cerowrt-devel] cerowrt-3.10.34-4 dev build released
To: Stephen Hemminger stephen@networkplumber.org
Cc: “cerowrt-devel@lists.bufferbloat.net” cerowrt-devel@lists.bufferbloat.net
I am actually far from convinced it is actually a wifi bug. It could
be something going wrong with routing, firewalling, nat, or something
else entirely. I have several captures of sw00 and ge00 taken after
the event occurs, and local udp, arp, and icmp and icmpv6 traffic is
working correctly. As is multicast.
The other device (sw10) stays running…
What I see in the captures I have is syn attempts from the sw00
interface do make it to the internet, and syn/ack attempts do return
through ge00, but
do not make it through sw00. However I don’t see ANY local syn
attempts in the capture I have: jg or someone needs to try a local tcp
connection to a local device or through the local router to a local
ethernet device after having it hang… (I will keep trying to
reproduce here)
tcp.flags == 0x0002
On Wed, Apr 2, 2014 at 6:48 PM, Stephen Hemminger
stephen@networkplumber.org wrote:
> I am seeing wireless hang as well.
> Mostly when multiple macbooks are active on 2.4g