after about 200 sec of heavy load, the router resets. Also in the
general case it becomes hard to get a connection to the webserver for
(example) streaming audio, and ssh won’t start due to the loadavg
(have to fix that in xinetd), and dns starts acting up and babeld
stops transmitting routes
Now, this is truly abnormal use - 250+Mbits going through the router
full time rather than 20Mbit or so.
So far this is ipv4 only. I was mostly testing ipv6 and fairly short
(60 second) durations and mostly
ethernet before now.
Things tried:
0) reducing the watchdog time to reset to 1 second rather than 5. This
seemed to help somewhat, but I’m going to rule it out.
1) turning off the qdiscs. This was mildly amusing as I was perturbed
by seeing ping times go from .2sec to 20 or 30ms on GigE with
pfifo_fast. It’s been a long time since these puppies were configured
as drop tail systems.
Anyway, still crashes. So I think we can regard the aqm system as
solid. Finally.
I do want to turn hoq-sfq off for further tests, tho.
3) bumping up BQL. Performance improves slightly, still crashes.
4) Not using the wireless at all. So far it’s survived 2000 seconds of
abuse, pounding 250+Mbit of ipv4 through it,
from 5 streams from two boxes.
So things point at some interaction with wireless.
theories are:
hostapd can’t get enough cpu time to run
the rngd daemon can’t get enough cpu time to run
we have a memory leak somewhere in the wireless stack
For the longest time I’ve had this thing running at HZ_256. Anyway I’m
going to leave ethernet loaded up overnight, then try wireless all by
itself for a few hours.
I have other fish to fry right now.
As for this problem and patches, there is patch review on the openwrt list going on…
See thread:
http://www.mail-archive.com/openwrt-devel@lists.openwrt.org/msg13520.html
If it’s interrupt-related, I’m going to guess that the lack of napi_poll support in the ath9k driver doesn’t help. (The Ethernet driver has it, so you should be able to route LAN->WAN with no problems.)
Can napi even work on wireless devices?
On Tue, May 1, 2012 at 10:05 AM, cerowrt@lists.bufferbloat.net wrote:
>
> Issue #379 has been updated by Robert Bradley.
>
>
> Yes, the real question was, “How related is this to #360 and
#371?” Apart from those two hiding it, I’m not sure we can say it
is related.
>
> If it’s interrupt-related, I’m going to guess that the lack of
napi_poll support in the ath9k driver doesn’t help. (The Ethernet
driver has it, so you should be able to route LAN->WAN with no
problems.)
> —————————————-
> Bug #379: 3.3.4-3 router crashes under heavy load
> https://www.bufferbloat.net/issues/379
>
> Author: David Taht
> Status: New
> Priority: Normal
> Assignee: Dave Täht
> Category: Linux Kernel
> Target version: 1st Public Cerowrt release
>
>
> So I started doing long duration, heavy access tests, driving
things
> with my fastest three boxes,
> crypted and unencrypted wireless, and two machines driving through
the
> internal ethernet at
> gigE speeds.
>
> after about 200 sec of heavy load, the router resets. Also in the
> general case it becomes hard to get a connection to the webserver
for
> (example) streaming audio, and ssh won’t start due to the loadavg
> (have to fix that in xinetd), and dns starts acting up and babeld
> stops transmitting routes
>
> Now, this is truly abnormal use - 250+Mbits going through the
router
> full time rather than 20Mbit or so.
>
> So far this is ipv4 only. I was mostly testing ipv6 and fairly
short
> (60 second) durations and mostly
> ethernet before now.
>
> Things tried:
>
> 0) reducing the watchdog time to reset to 1 second rather than 5.
This
> seemed to help somewhat, but I’m going to rule it out.
>
> 1) turning off the qdiscs. This was mildly amusing as I was
perturbed
> by seeing ping times go from .2sec to 20 or 30ms on GigE with
> pfifo_fast. It’s been a long time since these puppies were
configured
> as drop tail systems.
>
> Anyway, still crashes. So I think we can regard the aqm system as
> solid. Finally.
>
> I do want to turn hoq-sfq off for further tests, tho.
>
> 3) bumping up BQL. Performance improves slightly, still crashes.
>
> 4) Not using the wireless at all. So far it’s survived 2000 seconds
of
> abuse, pounding 250+Mbit of ipv4 through it,
> from 5 streams from two boxes.
>
> So things point at some interaction with wireless.
>
> theories are:
>
> hostapd can’t get enough cpu time to run
> the rngd daemon can’t get enough cpu time to run
> we have a memory leak somewhere in the wireless stack
>
> For the longest time I’ve had this thing running at HZ_256. Anyway
I’m
> going to leave ethernet loaded up overnight, then try wireless all
by
> itself for a few hours.
>
> I have other fish to fry right now.
>
>
On Tue, May 1, 2012 at 11:13 AM, cerowrt@lists.bufferbloat.net wrote:
>
> Issue #379 has been updated by Robert Bradley.
>
>
> The
ieee80211_ops
struct seems to think it can. I haven’t found a driver that uses it yet
- rtl8180 had it
(http://www.spinics.net/lists/linux-wireless/msg53741.html) but that got
reverted
(http://git.itanic.dy.fi/?p=linux-stable;a=patch;h=a6d27d2ac89359f84c1a559b5530967ff671d269).
> —————————————-
> Bug #379: 3.3.4-3 router crashes under heavy load
> https://www.bufferbloat.net/issues/379
>
> Author: David Taht
> Status: New
> Priority: Normal
> Assignee: Dave Täht
> Category: Linux Kernel
> Target version: 1st Public Cerowrt release
>
>
> So I started doing long duration, heavy access tests, driving
things
> with my fastest three boxes,
> crypted and unencrypted wireless, and two machines driving through
the
> internal ethernet at
> gigE speeds.
>
> after about 200 sec of heavy load, the router resets. Also in the
> general case it becomes hard to get a connection to the webserver
for
> (example) streaming audio, and ssh won’t start due to the loadavg
> (have to fix that in xinetd), and dns starts acting up and babeld
> stops transmitting routes
>
> Now, this is truly abnormal use - 250+Mbits going through the
router
> full time rather than 20Mbit or so.
>
> So far this is ipv4 only. I was mostly testing ipv6 and fairly
short
> (60 second) durations and mostly
> ethernet before now.
>
> Things tried:
>
> 0) reducing the watchdog time to reset to 1 second rather than 5.
This
> seemed to help somewhat, but I’m going to rule it out.
>
> 1) turning off the qdiscs. This was mildly amusing as I was
perturbed
> by seeing ping times go from .2sec to 20 or 30ms on GigE with
> pfifo_fast. It’s been a long time since these puppies were
configured
> as drop tail systems.
>
> Anyway, still crashes. So I think we can regard the aqm system as
> solid. Finally.
>
> I do want to turn hoq-sfq off for further tests, tho.
>
> 3) bumping up BQL. Performance improves slightly, still crashes.
>
> 4) Not using the wireless at all. So far it’s survived 2000 seconds
of
> abuse, pounding 250+Mbit of ipv4 through it,
> from 5 streams from two boxes.
>
> So things point at some interaction with wireless.
>
> theories are:
>
> hostapd can’t get enough cpu time to run
> the rngd daemon can’t get enough cpu time to run
> we have a memory leak somewhere in the wireless stack
>
> For the longest time I’ve had this thing running at HZ_256. Anyway
I’m
> going to leave ethernet loaded up overnight, then try wireless all
by
> itself for a few hours.
>
> I have other fish to fry right now.
>
>
Everything resets.
Easily hit with a hammer on a patch, which I’ll do if I get enough energy. Shouldn’t do that tho
I also note that I’ve been unable to crash it with the 3.3.5-3 build + codel, which cheers me up, relatively. I haven’t tried sfqred.
I’ve been watching your progress here for a while. I’ve recently started some work using OpenWRT (I know, not the same project), and I’m seeing crashes there as well, also under heavy load. It runs about 400 seconds under very heavy load and then just reboots. Watching the serial console shows no output when the reload occurs, the box simply starts rebooting. Specifics:
I’m using an alix 2D13 board. I have several and the problem is seen on all of them.
I’m using very vanilla OpenWRT builds with a near to default configuration. In fact, I’ve turned off dnsmasq and the firewall. I have no QoS configured. This is literally just port to port routing, nothing else.
I have two of the three ports connected to a Smartbits test device. I’m sending 64 byte packets between eth0 and eth1. eth2 is not connected. The tests start at 10% interface load and run for 90 seconds. If the test passes ( >0.01% packet loss), it increases the load. I see the following behavior:
10% load passes
55% load fails
32.5% load fails
21.5% load fails
60 seconds in to the next test the device reboots. This is repeatable
every time.
I guess what I’m getting at is I think perhaps this is an OpenWRT / kernel problem and not specific to CeroWRT. In fact, there’s a recently opened ticket on OpenWRT that describes something similar, though details there are sparse:
https://dev.openwrt.org/ticket/11882
I’m going to build a few images of OpenWRT to see if I can narrow down where it started crashing. Previous builds used to pass my entire test suite (64 bytes - 1500 byte packets). Now I can’t get past the 4th iteration of the first frame size. I suspect the move to kernel 3.3 in 31753 for the alix2 target is the cause. If you’re interested I’ll report back my findings.
I expected trouble with wireless, but didn’t have any until today… because I was doing the engineer thing and testing wireless by itself (driving it at 90+Mbit), and ethernet by itself, not both together. Sigh.
I can’t quite rule out iptables, and I should probably also test ipv6 under this scenario