Open source

Hunting a Linux kernel bug

Earlier last year, we identified a firewall misconfiguration which accidentally dropped most network traffic. We expected resetting the firewall configuration to fix the issue, but resetting the firewall configuration exposed a kernel bug!

Totally resetting the firewall gave us the default CentOS configuration, which blocks all traffic except for ssh. Bringing in our previously good firewall configuration didn't fix this because the default sysctl settings ran into a corner case kernel bug. Let's take a look at what this means and what happened.

To understand what happened, we'll need some background about a few different parts of our stack:

  • Linux reverse path filter
  • Mesos network isolation
  • Linux local packet routing

Reverse path filter

The reverse path filter is a mitigation against DDoS attacks that does a reverse routing table lookup for each incoming IP packet on the arrival interface by source IP address.

If the lookup succeeds and the next hop device returned by routing table matches the arrival interface, the packet is accepted. If not, the packet is considered a "martian" address and dropped before it enters the IP stack. For example, loopback IP addresses in the subnet 127.0.0.0/8 should never appear on any hardware interface attached to the wire and will therefore be identified as martians and dropped.

The reverse path filter is enabled by default on many Linux distros. This means that the sysctl net.ipv4.conf.all.rp_filter is set to 1. Setting this to 2 turns off "strict" mode and enables "loose" mode, which relaxes the interface matching rule, allowing packets that don't match the arrival interface to be received.

Mesos network isolation

We have too many datacenter machines to assign an IPv4 address per container. Instead, each host gets an IPv4 address and each container gets a port range. Our container configuration isolates the entire TCP/IP stack with network namespaces, including routing tables, IP addresses, as well as the loopback interface, which is used for all local network communications.

To communicate between containers on the same host, we create a virtual ethernet pair for each container. Packets go into one end of the pair and come out from the other end of the pair. The port range based network isolation is conceptually similar to the transport layer and the IP layer communication works as a whole stack, as normal. In particular, we have to explicitly bring back the local network communications between different containers on the same host after isolation.

Mesos uses Linux Traffic Control (TC) filters and actions to move packets between containers and the host (per above, this is based on the port). This turns out to be somewhat complicated and, as we'll see, the Linux routing table code doesn't have high test coverage for doing this when dealing with the loopback interface and virtual ethernet pairs.

Another piece of Mesos background is that when we introduced the port range based isolation, we were using an old kernel which had a few bugs that required us to turn off the reverse path filter. Those bugs have been fixed, but we've continued to set rp_filter to 0 because there wasn't a strong reason to change it back.

Routing local network packets

At the IP layer, we have two types of local network packets flying around within a host:

1) Loopback packets whose destination address is a loopback address (i.e., in the 127.0.0.0/8 IP subnet). Those packets are routed to the virtual loopback interface per the local routing table.

2) Packets whose destination IP address is owned by the host. Linux kernel optimizes this case by rerouting them to the loopback interface.

The virtual loopback interface is automatically created by kernel along with the loopback IP address, 127.0.0.1, and the routing entry for the 127.0.0.0/8 subnet. These are in the local routing table which is not visible by default:

This Tweet is unavailable
This Tweet is unavailable.
$ ip route show table local
broadcast 10.53.180.0 dev eth0 proto kernel scope link src 10.53.180.130
local 10.53.180.130 dev eth0 proto kernel scope host src 10.53.180.130
broadcast 10.53.180.255 dev eth0 proto kernel scope link src 10.53.180.130
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1

Another optimization is that every packet that goes through the loopback interface saves its routing information to avoid another routing table lookup when looping back to the same IP stack.

That information is discarded in virtual ethernet pair case with respect to network isolation, even though a routing table lookup could be saved for the same reason as in the case above.

One more complication is that, for all packets of type (1), packets with a loopback destination, the 127.0.0.0/8 subnet isn't considered routable unless the net.ipv4.conf.all.route_localnet sysctl is set to 1. And both type (1) and type (2) packets still aren't accepted by the routing table unless the net.ipv4.conf.all.accept_local sysctl is set to 1.

From this, we see that the optimization of saving routing information for loopback packets is actually built into the Linux networking stack.

Debugging

Now that we have the necessary background, let's look at the bug.

A firewall change caused the initial outage. The first thing we did was to reset the configuration and bring in CentOS defaults, which blocked all incoming connections except for ssh. We then restored a known good configuration. However, doing this and restarting the firewall services reset our sysctl settings too, particularly resetting net.ipv4.conf.all.rp_filter to 1.

Curiously, setting rp_filter to 1 prevented us from routing local network packets properly! On a debug host with net.ipv4.conf.all.log_martians turned on (which logs all martian packets), we saw

This Tweet is unavailable
This Tweet is unavailable.
[7949958.398209] IPv4: martian source 10.53.180.130 from 10.53.180.130, on dev lo 
[7949958.405429] ll header: 00000000: 98 03 9b 7f ff d0 98 03 9b 7f ff d0 08 00  
.............. 

[7949959.447357] IPv4: martian source 10.53.180.130 from 10.53.180.130, on dev lo 
[7949959.454577] ll header: 00000000: 98 03 9b 7f ff d0 98 03 9b 7f ff d0 08 00        
..............

Recall from the routing local packets section, these are type (2) packets since their source IP belongs to the same host. These packets are being routed through the loopback interface, as expected, so rp_filter shouldn't drop these packets in either strict or loose mode, but they're being dropped in strict mode.

There are a number of places where the packet could be dropped, too many to inspect every possible line of Linux kernel code. To find the offending code, we turned on all routing trace points via /sys/kernel/debug/tracing/events/fib/enable. These highly suspicious lines showed up in /sys/kernel/debug/tracing/trace:

This Tweet is unavailable
This Tweet is unavailable.
<...>-311708 [040] ..s1 7951180.957825: fib_table_lookup: table 254 oif 0 iif 1 src 10.53.180.130 dst 10.53.180.130 tos 0 scope 0 flags 0 
<...>-311708 [040] ..s1 7951180.957826: fib_table_lookup_nh: nexthop dev eth0 oif 4 src 10.53.180.130 

fib_table_lookup is the trace point for routing table lookup and fib_table_lookup_nh is the trace point right after routing table lookup for matching the next hop device.

We can see that the routing table returns eth0 as the next hop device instead of the loopback interface, lo, even though the lookup itself is done on lo (whose index is 1). This causes rp_filter to consider the packet a martian.

But it doesn't make sense that the next hop device of the local network packet is eth0 and not lo. If that were correct, the loopback interface wouldn't have been selected for egress!

This indicates that we should look at more details of Linux routing table.

Each time an IP address is added for a network interface, there are two routing entries automatically created for it by Linux kernel: one routing entry for the IP subnet represents an entire network and one routing entry for the specific IP address in the local routing table. The latter one is the table of interest here. It has a 32-bit prefix in order to match itself. Its routing type is marked as local and next hop device for which it was created.

In each routing table, a trie is used to speedup the routing table lookup, which is usually a form of longest prefix match algorithm. The routing entry for the specific IP address with a 32-bit prefix is added to the trie as a leaf since it's the longest. Linux exposes the trie in /proc/net/fib_trie:

This Tweet is unavailable
This Tweet is unavailable.
$ cat /proc/net/fib_trie
…
Local:
  +-- 0.0.0.0/1 2 0 2
     +-- 0.0.0.0/4 2 0 2
        |-- 0.0.0.0
           /0 universe UNICAST
        +-- 10.53.180.0/24 2 0 1
           |-- 10.53.180.0
              /32 link BROADCAST
              /24 link UNICAST
           |-- 10.53.180.130
              /32 host LOCAL
           |-- 10.53.180.255
              /32 link BROADCAST
     +-- 127.0.0.0/8 2 0 2
        +-- 127.0.0.0/31 1 0 0
           |-- 127.0.0.0
              /32 link BROADCAST
              /8 host LOCAL
           |-- 127.0.0.1
              /32 host LOCAL
        |-- 127.255.255.255
           /32 link BROADCAST

When the kernel does a lookup in the local routing table for an outgoing packet with destination address 10.53.180.130, its most specific routing entry matches and it returns eth0 as its next hop device.

Following the code path on the egress routing table lookup, we see that Linux kernel immediately amends the next hop device with the loopback interface after knowing this is a local route. Likewise for the reverse routing table lookup, eth0 is returned and clearly doesn’t match the incoming interface lo, which causes us to fail the strict rp_filter checks!

Bug fix

Our final fix for this is a small (5 line) patch that amends the reverse routing table lookup in a similar way to egress. In addition to the previous existing matching logic, if it's a local route, we also allow the packet to pass the filter if the arrival interface matches the loopback interface.

This Tweet is unavailable
This Tweet is unavailable.
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 317339cd7f03..e8bc939b56dd 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -388,6 +388,11 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	fib_combine_itag(itag, &res);
 
 	dev_match = fib_info_nh_uses_dev(res.fi, dev);
+	/* This is not common, loopback packets retain skb_dst so normally they
+	 * would not even hit this slow path.
+	 */
+	dev_match = dev_match || (res.type == RTN_LOCAL &&
+				  dev == net->loopback_dev);
 	if (dev_match) {
 		ret = FIB_RES_NHC(res)->nhc_scope >= RT_SCOPE_HOST;
 		return ret;

We also added a self-contained test case to make sure there isn't a regression in the future.

Conclusion

We learned how Linux routes some kinds of packets and about one corner case that Linux wasn't handling correctly. Thanks to trace events, we saw how to get internal Linux kernel state in user-space without modifying the source code. This kind of thing often comes in handy when you want to debug a kernel bug or do performance tuning.

This Tweet is unavailable
This Tweet is unavailable.