Over the past week, snare failed to serve about 85 of some 18 million ICAP
requests. Tonight, I began investigation into the source of those 85
failures. At 0420 this morning, I struck what may well be gold.
swps2 and swps1 use a repper at 192.168.61.30:
[ux-pol02-hst-02](0) $ xml sel -t -v //repperhost /usr/local/etc/snare/snare.conf
192.168.61.30
[ux-pol02-hst-02](0) $
swps-triage properly verifies that there's a route, and a full-duplex link
along the route:
[ux-pol02-hst-02](0) $ swps-triage | grep -A2 repper
Checking for a repper server...192.168.61.30
Checking for 192.168.61.30's interface...bce0
Checking 192.168.61.30 link... media: Ethernet autoselect (1000baseTX <full-duplex>)
[ux-pol02-hst-02](0) $
All seems well. The vast majority of librep requests are processed quickly
(there's been one retransmit among swps[12]'s ~1.5M repper lookups today)
and without problems.
Now, I've asked several times that we start using jumbo frames on the
backend. This is especially important given the questionable (to this day,
Sven, I call it questionable!) decision to use UDP for librep lookups; UDP
is a datagram-oriented transport, and large datagrams are prone to
fragmentation even within a Ethernet broadcast network. Indeed, without
jumbo frames we expect to see fragmentation at 1500 octets; it's 1994 and
we're running NFS on hubs all over again.
We see that the hardware, host-side, is jumbo frame-capable:
[ux-pol02-hst-02](0) $ ifconfig bce0 | grep options=
options=3b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_H
[ux-pol02-hst-02](0) $
This doesn't mean that the switch is necessarily capable, but it doesn't
matter because:
[ux-pol02-hst-02](0) $ ifconfig bce0 | grep mtu
bce0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,M
[ux-pol02-hst-02](0) $
we're not using jumbo frames anyway, at least not on this host. Now, all the
following tcpdumps are generated on swps1 via:
sudo tcpdump -i bce0 -s0 -v -n host 192.168.61.30 anto \\icmp
Observe what happens when we emit a datagram within the MTU (remember, 20
bytes for a default IPv4 header and 8 for an ICAP Echo Request, so we use
"sudo ping -s1472 -c1 192.168.61.30"):
08:51:14.881036 IP (tos 0x0, ttl 64, id 21040, offset 0, flags [none], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: ICMP echo request, id 26337, seq 0, length 1480
08:51:14.881315 IP (tos 0x0, ttl 64, id 65506, offset 0, flags [none], proto: ICMP (1), length: 1500) 192.168.61.30 > 192.168.61.51: ICMP echo reply, id 26337, seq 0, length 1480
it is exactly as we expected. Now, we exceed the MTU (a body of 1473 octets):
08:56:10.819118 IP (tos 0x0, ttl 64, id 23037, offset 0, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: ICMP echo request, id 22754, seq 0, length 1480
08:56:10.819132 IP (tos 0x0, ttl 64, id 23037, offset 1480, flags [none], proto: ICMP (1), length: 21) 192.168.61.51 > 192.168.61.30: icmp
08:56:10.819462 IP (tos 0x0, ttl 64, id 2622, offset 0, flags [+], proto: ICMP (1), length: 1500) 192.168.61.30 > 192.168.61.51: ICMP echo reply, id 22754, seq 0, length 1480
08:56:10.819489 IP (tos 0x0, ttl 64, id 2622, offset 1480, flags [none], proto: ICMP (1), length: 21) 192.168.61.30 > 192.168.61.51: icmp
See now how 2 datagrams become 4 packets via fragmentation, and again all is
as we expect. Now, set the Don't Fragment bit via the -D option:
[ux-pol01-hst-02](2) $ sudo ping -s1472 -v -c1 -D 192.168.61.30
PING 192.168.61.30 (192.168.61.30): 1472 data bytes
1480 bytes from 192.168.61.30: icmp_seq=0 ttl=64 time=0.395 ms
--- 192.168.61.30 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.395/0.395/0.395/0.000 ms
[ux-pol01-hst-02](0) $ sudo ping -s1473 -v -c1 -D 192.168.61.30
PING 192.168.61.30 (192.168.61.30): 1473 data bytes
ping: sendto: Message too long
--- 192.168.61.30 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
[ux-pol01-hst-02](2) $
The second ping(8) operation fails immediately, generating an EMSGSIZE
error -- no packed is emitted. Had the fragmentation been necessary later
in the path, an ICMP Needed Fragmentation but DF Was Set reply packet would
be generated instead. ICMP errors like this are why a blanket "deny ICMP"
policy on a firewall are deadly kisses; DF, for instance, is used in PMTU
discovery. Let's allow fragmentation once more, and go larger. 8164 octets
of ICMP body work just fine:
[ux-pol01-hst-02](0) $ sudo ping -s8164 -v -c1 192.168.61.30
PING 192.168.61.30 (192.168.61.30): 8164 data bytes
8172 bytes from 192.168.61.30: icmp_seq=0 ttl=64 time=0.944 ms
--- 192.168.61.30 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.944/0.944/0.944/0.000 ms
[ux-pol01-hst-02](0) $
Note no fewer than 12 packets' generation as a result of these two
datagrams, remembering that any fragment's drop requires timeout and
retransmission of the entire packet:
09:08:42.262511 IP (tos 0x0, ttl 64, id 27975, offset 0, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: ICMP echo request, id 32996, seq 0, length 1480
09:08:42.262526 IP (tos 0x0, ttl 64, id 27975, offset 1480, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: icmp
09:08:42.262536 IP (tos 0x0, ttl 64, id 27975, offset 2960, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: icmp
09:08:42.262548 IP (tos 0x0, ttl 64, id 27975, offset 4440, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: icmp
09:08:42.262558 IP (tos 0x0, ttl 64, id 27975, offset 5920, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: icmp
09:08:42.262569 IP (tos 0x0, ttl 64, id 27975, offset 7400, flags [none], proto: ICMP (1), length: 792) 192.168.61.51 > 192.168.61.30: icmp
09:08:42.263106 IP (tos 0x0, ttl 64, id 9334, offset 0, flags [+], proto: ICMP (1), length: 1500) 192.168.61.30 > 192.168.61.51: ICMP echo reply, id 32996, seq 0, length 1480
09:08:42.263132 IP (tos 0x0, ttl 64, id 9334, offset 1480, flags [+], proto: ICMP (1), length: 1500) 192.168.61.30 > 192.168.61.51: icmp
09:08:42.263156 IP (tos 0x0, ttl 64, id 9334, offset 2960, flags [+], proto: ICMP (1), length: 1500) 192.168.61.30 > 192.168.61.51: icmp
09:08:42.263183 IP (tos 0x0, ttl 64, id 9334, offset 4440, flags [+], proto: ICMP (1), length: 1500) 192.168.61.30 > 192.168.61.51: icmp
09:08:42.263232 IP (tos 0x0, ttl 64, id 9334, offset 5920, flags [+], proto: ICMP (1), length: 1500) 192.168.61.30 > 192.168.61.51: icmp
09:08:42.263250 IP (tos 0x0, ttl 64, id 9334, offset 7400, flags [none], proto: ICMP (1), length: 792) 192.168.61.30 > 192.168.61.51: icmp
Now let's try 8165 octets...it fails, after a timeout! Was is das?
[ux-pol01-hst-02](0) $ sudo ping -s8165 -v -c1 192.168.61.30
PING 192.168.61.30 (192.168.61.30): 8165 data bytes
--- 192.168.61.30 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
[ux-pol01-hst-02](2) $
The tcpdump is full of naught but dashed hopes and idle dreams:
09:12:35.946734 IP (tos 0x0, ttl 64, id 29250, offset 0, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: ICMP echo request, id 17381, seq 0, length 1480
09:12:35.946750 IP (tos 0x0, ttl 64, id 29250, offset 1480, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: icmp
09:12:35.946760 IP (tos 0x0, ttl 64, id 29250, offset 2960, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: icmp
09:12:35.946772 IP (tos 0x0, ttl 64, id 29250, offset 4440, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: icmp
09:12:35.946788 IP (tos 0x0, ttl 64, id 29250, offset 5920, flags [+], proto: ICMP (1), length: 1500) 192.168.61.51 > 192.168.61.30: icmp
09:12:35.946813 IP (tos 0x0, ttl 64, id 29250, offset 7400, flags [none], proto: ICMP (1), length: 793) 192.168.61.51 > 192.168.61.30: icmp
No response whatsoever, at least that we can see! This is completely
repeatable. The magic number 8165 breaks things...why? Well, it's 1 greater
than 8164, which is 8 bytes of ICMP header and 20 bytes of IPv4 header less
than your friend and mine, 2 raised to the fighting 13th power, 8192. Is
this a FreeBSD thing? Let's try devsnare:
[ux-pol01-dhst-01](0) $ sudo ping -s20000 -d -c1 192.168.77.30
PING 192.168.77.30 (192.168.77.30): 20000 data bytes
20008 bytes from 192.168.77.30: icmp_seq=0 ttl=64 time=1.066 ms
--- 192.168.77.30 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.066/1.066/1.066/0.000 ms
[ux-pol01-dhst-01](0) $
At 20000 octets, it's doing just fine! Indeed, it will up until 25152 octets
of ICMP body (at this point I think it hits the kernel sysctl
net.inet.ip.maxfragsperpacket, with its default of 16, but the 1152-octet
discrepancy confounds me). At that point, we get a response on FreeBSD to Linux:
05:31:09.182027 IP prometheus.research.sys > aho.research.sys: ICMP echo request, id 39876, seq 0, length 1480
05:31:09.182032 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182036 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182045 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182054 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182059 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182063 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182070 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182084 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182090 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182093 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182097 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182101 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182108 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182111 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182114 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182117 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.182120 IP prometheus.research.sys > aho.research.sys: icmp
05:31:09.184942 IP aho.research.sys > prometheus.research.sys: ICMP echo reply, id 39876, seq 0, length 1480
05:31:09.185054 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.185177 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.185300 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.185422 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.185546 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.185669 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.185791 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.185915 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.186037 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.186160 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.186284 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.186406 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.186530 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.186658 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.186775 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.186899 IP aho.research.sys > prometheus.research.sys: icmp
05:31:09.186903 IP aho.research.sys > prometheus.research.sys: icmp
but do not see it on the FreeBSD client:
[prometheus](0) $ sudo ping -s25153 -d -c1 svn.research.sys
PING svn.research.sys (10.14.1.9): 25153 data bytes
--- svn.research.sys ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
[prometheus](2) $
while FreeBSD to FreeBSD receives no response:
09:39:13.277193 IP (tos 0x0, ttl 64, id 12478, offset 0, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: ICMP echo request, id 50698, seq 0, length 1480
09:39:13.277215 IP (tos 0x0, ttl 64, id 12478, offset 1480, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277227 IP (tos 0x0, ttl 64, id 12478, offset 2960, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277238 IP (tos 0x0, ttl 64, id 12478, offset 4440, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277248 IP (tos 0x0, ttl 64, id 12478, offset 5920, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277259 IP (tos 0x0, ttl 64, id 12478, offset 7400, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277271 IP (tos 0x0, ttl 64, id 12478, offset 8880, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277282 IP (tos 0x0, ttl 64, id 12478, offset 10360, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277294 IP (tos 0x0, ttl 64, id 12478, offset 11840, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277372 IP (tos 0x0, ttl 64, id 12478, offset 13320, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277399 IP (tos 0x0, ttl 64, id 12478, offset 14800, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277438 IP (tos 0x0, ttl 64, id 12478, offset 16280, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277468 IP (tos 0x0, ttl 64, id 12478, offset 17760, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277481 IP (tos 0x0, ttl 64, id 12478, offset 19240, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277492 IP (tos 0x0, ttl 64, id 12478, offset 20720, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277507 IP (tos 0x0, ttl 64, id 12478, offset 22200, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277520 IP (tos 0x0, ttl 64, id 12478, offset 23680, flags [+], proto ICMP (1), length 1500) 192.168.77.10 > 192.168.77.30: icmp
09:39:13.277531 IP (tos 0x0, ttl 64, id 12478, offset 25160, flags [none], proto ICMP (1), length 21) 192.168.77.10 > 192.168.77.30: icmp
and mighty Linux to Linux has no problems all the way up, because it's great:
[aho](2) $ ping -s65507 -d -c1 knuth.research.sys
PING knuth.research.sys (10.14.120.39) 65507(65535) bytes of data.
65515 bytes from knuth.research.sys (10.14.120.39): icmp_seq=1 ttl=64 time=4.62ms
--- knuth.research.sys ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 4.627/4.627/4.627/0.000 ms
[aho](0) $
So, observations to this point:
- FreeBSD always seems to successfully return an error to the application on failure to emit a packet. If it can't be emitted, according to the host'sbest idea of the medium, EMSGSIZE is properly delivered.
- FreeBSD seems to by default drop some incoming packets without generating any kind of error response, despite the desirability of doing so.
- librep generates, on average, packets well within the default MTU. It can, however, coalesce several requests into a much larger packet. QUERY_LEN is defined within librep.h to be 10000 (huh? Goddamnit, Jeremy).
- Snare to repper in production is always FreeBSD to FreeBSD.
- Something janky's afoot for swps[12] on large ICMP and, most likely, UDP, affecting packets of 8192 octets or more.
- colosnare has shown noticably less repper retransmits and ERRORs, even when all ALP traffic was diverted through it.
- swps[12] have shown max transaction-per-second counts and open file descriptor counts exceeding 200. This could lead to large librep packets.
My conjectures, hypotheses and conclusions:
- There is a setting on the swps repper, or on their common switch, causing large or heavily fragmented IPv4 packets to be dropped. This ought be rectified. It is likely responsible for at least some of our (very rare) failures.
- Snare/Repper clusters ought employ jumbo frames, making use only of that hardware which supports it. If WebWasher can be configured to use jumbo frames (Erik, this is your chance to shine!), we ought definitely do that as well -- otherwise, we'll want to set up a VLAN (or host route hacks).
- Librep stupidly emits UDP packets exceeding the PMTU; its original author was raised by wolves, and didn't know better. This "optimization" leads to fragmentation, wherein coalescing of requests is undone -- in any process, energy is lost, and so here we lose throughput (fragmentation overhead), and independence of datagrams (fragments operate as a Layer 3 group); if a firewall or host along the way silently drops the packet, we lose the game entirely, retransmitting unto oblivion and finally cold failure. This ought be fixed; in the absence of PMTU, use the Internet MTU of 576.
- There's likely at least one thing I'm missing here. Thoughts?
--rigorously, nick
--
Nick Black <dankamongmen@acm.org>
"NP: The class of dashed hopes and idle dreams."
Principal Engineer -- Secure Computing Applied Research
boom, motherfuckers!