As the line blurs between nodes that used to just do things such as client and server applications and nodes that would just do networking functions such as Layer 3 routers and Layer 2 switches I find that I have to remind myself, and others about a few things. These are things I know, but need to be reminded of in the heat of the moment. When you are working on a router or switch, being “multi-homed”, that is having multiple interfaces in the node is a given. On a “host” such as node running a web server, one typically receives HTTP requests and reply to them. Typically the reply will be sent via the interface the request arrived on. But this isn’t always the case, in fact, it is typically not the case as I’ll describe here.
As the networking world is more and more virtualized, that is networking functions are being implemented on hosts that have traditionally be servers, it’s important to be reminded of how the “networking” world works.
These concepts exist anywhere TCP/IP is used. But I describe them here using Unix in general, but Linux specifically. Below I’ll describe what I believe are common misconceptions about IP addresses and network interfaces. The main topics covered are:
- IPv4 and IPv6 are UNI-directional not bi-directional
- IP addresses are NOT configured on interfaces but on the node
- RFC 1918 IPv4 addresses are indeed routable
Misconceptions about IP Addresses and Interfaces
On a node with one interface, say eth0 packet flow is obvious. Any packets sent out will egress eth0 and anything to in will ingress eth0. The same if wlan0 is used (and eth0 is now disabled.) The wlan0 interface is used for all traffic. But what happens if both eth0 and wlan0 were enabled at the same time? Or more likely, you are on a Linux server running Apache with more than one interface to the Internet? How about a Bare Metal node that is “dual-homed”? Things are now not as simple as before.
In this post I’ll describe a few misconceptions I’ve noticed over the years related to how IP addresses and interfaces relate to each other.
IPv4 and IPv6 are UNI-directional
The IP protocols (both IPv4 and IPv6) are uni-directional. There is no guarantee that a link used to send a packet to nodeZ will be the same link used when nodeZ sends a packet back to you. When more than one interface exists in a node you can (and often do) send packets on ethX and receive packets on ethY, even for the same TCP connection. This is a feature, not a bug. Being uni-directional allows the network to scale globally (i.e. The Internet.) In this context, interface is logical. You may have just one physical NIC, but that NIC may represent more than one “interface” (e.g. Linux netdev device). For example, multiple VLANs on the same NIC, or the NIC is using macvlan, SRIOV etc.
On Linux you may have already experienced this. The sysctl “net.ipv4.conf.all.rp_filter” used to have a binary value. This binary value needed to change to account for the possibility that a packet isn’t always received from a remote node using the link used by the node to sent to it. When there are more than one interfaces, packets received by the local node do not always arrive via the interface used to send to the destination the packet arrived on.
When a node (nodeA) is connected to the outside world via multiple links, a given node can only control how a packet will egress the node. The local routing table used will indicate which interface, ethX or ethY to use to send the packet to the next node in the path used to reach the packets destination (say nodeZ). This decision is made on each node in the IP network until the packet reaches the node that containers the packets destination.
The same is true in the opposite direction. When the destination of the packet described above, nodeZ wants to send back to nodeA the process repeats, in the reverse direction. Packets in the path may not even traverse the same nodes let alone the same links from nodeZ back to nodeA as those used for nodeA to send to nodeZ.
If you drive in a large city like I do, with one way streets, the concept will look familiar. How you drive (or Uber) from your house to the pub is probably not the same route used to get back home.
I’ll show a simplified example of this on Linux using podman next. Start two containers using podman. Each container will have multiple interfaces, eth0 and eth1. Each eth0 connects to net10 (10.200.10.0/24) and eth1 connects to net11 (10.200.11.0/24).
First let’s setup nodeA. We will start a container with two interfaces in it. The interface eth0 will connect to net10, which uses the 10.200.10.0/24 subnet. The eth1 interface will connect to net11, which uses the 10.200.11.0/24 subnet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
sudo podman run --detach --name=nodeA --network=net10,net11 f34tools b8abb498dc3f00253308140b4936a9d45bbf8358af6e4e4de726b5872dea6e73 [mcc@snark]$ sudo podman exec -it nodeA /bin/bash [root@b8abb498dc3f /]# [root@b8abb498dc3f /]# ip addr show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet 10.67.78.1/32 scope global lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 56:0d:f7:7e:e0:b0 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.200.10.5/24 brd 10.200.10.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::540d:f7ff:fe7e:e0b0/64 scope link valid_lft forever preferred_lft forever 3: eth1@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether ee:17:4b:31:a1:b9 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.200.11.5/24 brd 10.200.11.255 scope global eth1 valid_lft forever preferred_lft forever inet6 fe80::ec17:4bff:fe31:a1b9/64 scope link valid_lft forever preferred_lft forever [root@b8abb498dc3f /]# ip route show default via 10.200.11.1 dev eth1 default via 10.200.10.1 dev eth0 10.200.10.0/24 dev eth0 proto kernel scope link src 10.200.10.5 10.200.11.0/24 dev eth1 proto kernel scope link src 10.200.11.5 [root@b8abb498dc3f /]# |
Now we setup node nodeZ:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
sudo podman run --detach --name=nodeZ --network=net10,net11 f34tools b8abb498dc3f00253308140b4936a9d45bbf8358af6e4e4de726b5872dea6e73 [mcc@snark]$ sudo podman exec -it nodeZ /bin/bash [root@974d4f1c1a3b /]# ip addr show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet 10.99.99.99/32 scope global lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 9e:a8:27:1d:c6:ca brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.200.10.4/24 brd 10.200.10.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::9ca8:27ff:fe1d:c6ca/64 scope link valid_lft forever preferred_lft forever 3: eth1@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether fe:48:22:61:f9:1e brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.200.11.4/24 brd 10.200.11.255 scope global eth1 valid_lft forever preferred_lft forever inet6 fe80::fc48:22ff:fe61:f91e/64 scope link valid_lft forever preferred_lft forever [root@974d4f1c1a3b /]# ip route show default via 10.200.11.1 dev eth1 default via 10.200.10.1 dev eth0 10.67.78.1 via 10.200.11.5 dev eth1 10.200.10.0/24 dev eth0 proto kernel scope link src 10.200.10.4 10.200.11.0/24 dev eth1 proto kernel scope link src 10.200.11.4 [root@974d4f1c1a3b /]# |
Ensure that we can ping from nodeA to nodeZ:
1 2 3 4 5 6 7 8 9 10 |
[root@b8abb498dc3f /]# ping 10.200.10.4 PING 10.200.10.4 (10.200.10.4) from 10.200.11.5 : 56(84) bytes of data. 64 bytes from 10.200.10.4: icmp_seq=1 ttl=64 time=0.278 ms 64 bytes from 10.200.10.4: icmp_seq=2 ttl=64 time=0.128 ms 64 bytes from 10.200.10.4: icmp_seq=3 ttl=64 time=0.158 ms --- 10.200.10.4 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 0.128/0.188/0.278/0.064 ms [root@b8abb498dc3f /]# |
So far so good, nothing new here. Now for the not so typical stuff. What happens when you tell ping to use the address of a different interface (that of eth1 using -I 10.200.11.5) to reach an address (10.200.10.4) which is “configured for” eth0?
On nodeA we do:
1 2 3 4 5 6 7 8 9 10 |
[root@b8abb498dc3f /]# ping -c 3 -I 10.200.11.5 10.200.10.4 PING 10.200.10.4 (10.200.10.4) from 10.200.11.5 : 56(84) bytes of data. 64 bytes from 10.200.10.4: icmp_seq=1 ttl=64 time=0.278 ms 64 bytes from 10.200.10.4: icmp_seq=2 ttl=64 time=0.128 ms 64 bytes from 10.200.10.4: icmp_seq=3 ttl=64 time=0.158 ms --- 10.200.10.4 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 0.128/0.188/0.278/0.064 ms [root@b8abb498dc3f /]# |
We see ping was successful. To see what is going on, let’s look at what was put “on the wire” and “which” wire was used, when. In different windows tcpdump is run to capture packets on on each interface. First eth0:
1 2 3 4 5 6 7 8 9 10 11 12 |
[root@b8abb498dc3f /]# tcpdump -nn -vvv -i eth0 dropped privs to tcpdump tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 13:59:35.562675 IP (tos 0x0, ttl 64, id 29952, offset 0, flags [DF], proto ICMP (1), length 84) 10.200.11.5 > 10.200.10.4: ICMP echo request, id 22, seq 1, length 13:59:36.563667 IP (tos 0x0, ttl 64, id 30096, offset 0, flags [DF], proto ICMP (1), length 84) 10.200.11.5 > 10.200.10.4: ICMP echo request, id 22, seq 2, length 64 13:59:37.565785 IP (tos 0x0, ttl 64, id 30154, offset 0, flags [DF], proto ICMP (1), length 84) 10.200.11.5 > 10.200.10.4: ICMP echo request, id 22, seq 3, length 64 13:59:41.021645 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.200.10.4 tell 10.200.10.5, length 28 13:59:41.021743 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.200.10.4 tell 10.200.10.5, length 28 13:59:41.021826 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.200.10.4 is-at 9e:a8:27:1d:c6:ca, length 28 |
Notice that the ICMP Echo Request is seen, but not the ICMP Echo Reply. The Echo Request is sent on eth0, but the Echo Reply will arrive via eth1:
1 2 3 4 5 6 7 8 9 10 11 |
[root@b8abb498dc3f /]# tcpdump -nn -vvv -i eth1 dropped privs to tcpdump tcpdump: listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes 13:59:35.562789 IP (tos 0x0, ttl 64, id 913, offset 0, flags [none], proto ICMP (1), length 84) 10.200.10.4 > 10.200.11.5: ICMP echo reply, id 22, seq 1, length 64 13:59:36.563751 IP (tos 0x0, ttl 64, id 1193, offset 0, flags [none], proto ICMP (1), length 84) 10.200.10.4 > 10.200.11.5: ICMP echo reply, id 22, seq 2, length 64 13:59:37.565884 IP (tos 0x0, ttl 64, id 2040, offset 0, flags [none], proto ICMP (1), length 84) 10.200.10.4 > 10.200.11.5: ICMP echo reply, id 22, seq 3, length 64 13:59:41.021734 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.200.11.5 tell 10.200.11.4, length 28 13:59:41.021765 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.200.11.5 is-at ee:17:4b:31:a1:b9, length 28 |
On node nodeZ we also run tcpdump on both eth0 and eth1. You will see that the ICMP-Echo arrives on the eth0 interface, and the ICMP-Echo-Reply is sent on the other, eth1 interface. This is the interface pointed to via the route table (shown above) on nodeZ for any packets it sends.
In the first window tcpdump is run on eth0:
1 2 3 4 5 6 7 8 9 10 11 |
[root@974d4f1c1a3b /]# tcpdump -nn -vvv -i eth0 dropped privs to tcpdump tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 13:59:35.562723 IP (tos 0x0, ttl 64, id 29952, offset 0, flags [DF], proto ICMP (1), length 84) 10.200.11.5 > 10.200.10.4: ICMP echo request, id 22, seq 1, length 64 13:59:36.563696 IP (tos 0x0, ttl 64, id 30096, offset 0, flags [DF], proto ICMP (1), length 84) 10.200.11.5 > 10.200.10.4: ICMP echo request, id 22, seq 2, length 64 13:59:37.565819 IP (tos 0x0, ttl 64, id 30154, offset 0, flags [DF], proto ICMP (1), length 84) 10.200.11.5 > 10.200.10.4: ICMP echo request, id 22, seq 3, length 64 13:59:41.021746 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.200.10.4 tell 10.200.10.5, length 28 13:59:41.021785 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.200.10.4 is-at 9e:a8:27:1d:c6:ca, length 28 |
The source address is what was “configured” on eth1 on nodeA, but received on eth0. Since the destination of the packet is 10.200.10.4, the route table on nodeA is used to figure out what interfce, eth0, to use to send the packet. The source address isn’t used.
At the same time, in another window, tcpdump is also being run on eth1:
1 2 3 4 5 6 7 8 9 10 11 12 |
[root@974d4f1c1a3b /]# tcpdump -nn -vvv -i eth1 dropped privs to tcpdump tcpdump: listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes 13:59:35.562768 IP (tos 0x0, ttl 64, id 913, offset 0, flags [none], proto ICMP (1), length 84) 10.200.10.4 > 10.200.11.5: ICMP echo reply, id 22, seq 1, length 64 13:59:36.563733 IP (tos 0x0, ttl 64, id 1193, offset 0, flags [none], proto ICMP (1), length 84) 10.200.10.4 > 10.200.11.5: ICMP echo reply, id 22, seq 2, length 64 13:59:37.565866 IP (tos 0x0, ttl 64, id 2040, offset 0, flags [none], proto ICMP (1), length 84) 10.200.10.4 > 10.200.11.5: ICMP echo reply, id 22, seq 3, length 64 13:59:41.021607 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.200.11.5 tell 10.200.11.4, length 28 13:59:41.021736 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.200.11.5 tell 10.200.11.4, length 28 13:59:41.021817 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.200.11.5 is-at ee:17:4b:31:a1:b9, length 28 |
As you see, nodeA sent ping on eth0. Using -I tells ping to use the address 10.200.11.5 “configured” for eth1. It was done just for this explanation. In the real world, the route tables used between source and destination are NOT typically symmetric as in this simple example. That’s the point of this post.
I had to learn this the hard way when I installed my first IPsec offload board in a node. This device dealt with all IPsec encryption. Packets were sent to the device from Unix server and were put on the wire encrypted using IPsec . However packets from the remote end sent back arrived at a different interface, thus bypassing decryption. We had a problem. If you ever wonder why Juniper routers have a dedicated IPsec blade in their chassis, accessible via any port in the chassis, it is to solve this problem.
You might have noticed I place “configured” in double quotes when talking about addresses and interfaces. As I’ll explain shortly, interfaces don’t really have an address. This is another misconception. The interfaces themselves do not have addresses, the nodes does. That’s the topic of the next section.
IP addresses are NOT configured on interfaces
You already know this intuitively, but are so used to the simple case, it is understandable when you forget. Syntax like this example configuration can also mislead you:
1 2 3 4 5 6 |
ip addr add 192.168.0.1/24 dev eth0 ip addr add 192.168.1.1/24 dev eth1 ip addr add 192.168.2.1/24 dev eth2 ip addr add 172.17.0.1/16 dev docker0 ip addr add 10.200.10.1/24 dev net10 ip addr add 10.200.11.1/24 dev net11 |
One can get the impression from this example configuration syntax that the link has the address, but that’s not accurate. When you only have one interface on your laptop it’s understandable to think this is the case. Likewise if your cloud VM just has an eth0 interface. You are actually doing is configuring an interface for the node itself. In the case of interfaces which connect to broadcast media (aka Ethernet), this familiar command is doing multiple things at once. You are adding an interface to the node itself as mentioned, but also a static route to the subnet reachable via the interface. The assumption is you will want to route to anything on the subnet via this interface. When the interface that has an address assigned to it using ip addr add
isn’t on broadcast media (i.e. point to point link), you will not have a route added for you.
If you think about the description of IP protocols being UNI-directional as described above, it starts to become obvious that you are in fact just adding an address to the node. This is exactly how nodeA was able to use any address configured on nodeA (10.200.11.5) to ping nodeZ address 10.200.10.4. The source address didn’t matter; the destination address is used to lookup in the routing table which interface is to be used as the next hop node in the path to that destination. As expected, on nodeA to reach anything on 10.200.10.0/24 eth0 was used. Upon receipt, nodeZ needed to respond to 10.200.11.5. It’s route table said to send the packet via eth1. When nodeA receives a packet on eth1 destined for ANY address local to the node, the packet is accepted. The interface the packet arrived on doesn’t matter. NodeA was able to process a packet destined to it that arrived on an interface configured for a different subnet just fine.
The takeaway here is that ALL IP addresses configured on the node can be reached via any interface. Thik of it this way. If your house has a front door and a back door, regardless of which door I use, I reach you. On broadcast (e.g. Ethernet) networks, if a node receives an ARP for 10.200.11.4 via eth0, IT WILL respond with Ethernet address of eth0 and vice versa for any address that is local to the node.
Those who care about security leverage this. IP addresses are placed on loopback interface only or just on any virtual interface used by the node such as local bridges (br0, virbr0, podman0, docker0, etc.) A “link local” (LL) address is used for all others. With IPv6 link local addresses are common, but they exist for IPv4 too. Using LL addresses in IPv4 has an added benefit of saving (or reclaiming) Global IPv4 addresses..
By only using addresses where they are needed, the attack surface becomes smaller. As shown earlier a node accepts packets for any IP address local to the node, regardless of what interface the packet arrived on. If you have 4, 8, 12, 24 ports etc. (physical or virtual) on your node, and each had a public IP address just for the port to talk to other nodes adjacent to the ports, any of those addresses could be used to hack into the node. Using LL addresses you don’t don’t need firewall rules for “transit” interface address used only between nodes. Nothing except the other end of the link can address your end of the link.
Which brings up another common security (and scaling) aspect. When connecting nodes (which is what networking does), it is better to use /31 prefix (IPv4) and only IPv6 Link Local Addresses. Do this even if you are using broadcast (i.e. Ethernet) media. (If LL is a problem for your site use /127 prefixes.) Now each end of the prefix (i.e. link) is either the .0 or .1 address. For example every ToR node uses .0 and nodes which connect to it (e.g. Unix servers) use .1 There is now no overhead for ARP or Neighbor Discovery.
Absent manually using something like the example syntax above you need a routing protocol, Ansible or orchestration mechanism etc. to configure routes to make use of these interfaces. Modern Data Centers and Enterprises have been doing this for decades. When using routing protocols, it turns out that BGP better at this than OSPF/IS-ISRFC 1918 Addresses are not routable addresses
RFC 1918 IPv4 addresses are not routable
I often hear are that RFC1918 addresses, aka “private IPv4 addresses” are not routable. This is completely false. These are addresses from 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 address ranges you have probably seen used by e.g. your home router(s). These addresses are not routable ON THE INTERNET. That is no ISP will accept a route advertisement, from you or anyone else including another ISP, with one of these address prefixes. They are routeable on your home network, or on your internal corporate network. I’m able to communicate with colleagues at work who all over the planet, all using an address from the 10.0.0.0/8 prefix. I can guarantee you that we are not all on the same Layer 2 network. Packet can and are able to be routed from one floor to another, between building, between campuses and countries, provided the traffic is intended to stay within the organization.
IPv6 uses ULA (Unique Local Address) for a similar purpose. These addresses are routable and exist for any communication known to be local to the “domain” (e.g. your home network, your company etc.) For example, DNS query for corporate servers would return a ULA to reach that server from inside the company.
If you need to leave your “domain”, your application either picks a globally routable address to use (when possible.) Failing that, NAT can be used, but a/ only at the border to the public network (e.g. your home router connection to your ISP, corporate egress to the Internet etc.) and b/ never for IPv6.