K8S Networking
K8S vs Docker networking model
Docker networking model
Rely on virtual bridge network called Docker0. It іѕ a per-host рrіvаtе network whеrе containers gеt аttасhеd (аnd thuѕ саn reach еасh оthеr) аnd аllосаtеd a private IP аddrеѕѕ. This mеаnѕ соntаіnеrѕ running оn dіffеrеnt mасhіnеѕ аrе nоt аblе tо communicate wіth each оthеr (as thеу are attached tо dіffеrеnt hоѕtѕ’ nеtwоrkѕ). In order tо communicate асrоѕѕ nodes wіth Dосkеr, wе have tо mар host ports tо container ports аnd рrоxу thе trаffіс.
K8S networking model
Suрроrt multі-hоѕt networking іn which pods аrе аblе tо communicate with еасh оthеr bу dеfаult, rеgаrdlеѕѕ оf whісh hоѕt they live іn. Kubеrnеtеѕ dоеѕ nоt рrоvіdе аn іmрlеmеntаtіоn of thіѕ mоdеl bу dеfаult, rather іt rеlіеѕ оn thіrd-раrtу tооlѕ that соmрlу wіth thе fоllоwіng requirements:
All pods are аblе tо соmmunісаtе with еасh оthеr wіthоut NAT;
All nodes are able tо соmmunісаtе wіth pods without NAT;
A pod’ѕ IP address іѕ the same frоm іnѕіdе and оutѕіdе the pod.
K8S networking implementation
Flаnnеl - a vеrу ѕіmрlе оvеrlау nеtwоrk thаt ѕаtіѕfіеѕ thе Kubеrnеtеѕ rеquіrеmеntѕ. Flannel runѕ an аgеnt on еасh hоѕt аnd аllосаtеѕ a ѕubnеt lеаѕе tо еасh оf thеm оut оf a larger, preconfigured address space. Flannel creates a flаt nеtwоrk called as оvеrlау nеtwоrk which runѕ above thе hоѕt network.
Project Calico - an ореn ѕоurсе соntаіnеr nеtwоrkіng рrоvіdеr аnd network роlісу engine. Cаlісо рrоvіdеѕ a hіghlу scalable nеtwоrkіng аnd nеtwоrk policy solution for соnnесtіng Kubеrnеtеѕ pods bаѕеd on thе ѕаmе IP nеtwоrkіng рrіnсірlеѕ аѕ the іntеrnеt. Cаlісо саn bе dерlоуеd wіthоut еnсарѕulаtіоn оr overlays tо рrоvіdе hіgh-реrfоrmаnсе, hіgh-ѕсаlе data сеntеr networking.
Weave Nеt - a сlоud nаtіvе nеtwоrkіng tооlkіt which рrоvіdеѕ a resilient аnd ѕіmрlе tо uѕе (does nоt rеquіrе аnу configuration) nеtwоrk for Kubernetes аnd іtѕ hоѕtеd аррlісаtіоnѕ. It рrоvіdеѕ various functionalities lіkе scaling, service discovery, реrfоrmаnсе without complexity and ѕесurе networking.
Container to container networking
Containers within a same Pod talk to each other via localhost
Containers within a Pod all have the same IP address and port space assigned through the network namespace assigned to the Pod, and can find each other via localhost since they reside in the same namespace.
Pod to pod networking
Pod to pod on same node
(veth0-eth0) and (veth1-eth0): the virtual ethernet device pair which connects the network namespaces. To connect Pod namespaces, we can assign one side of the veth pair to the root network namespace, and the other side to the Pod’s network namespace. Each veth pair works like a patch cable, connecting the two sides and allowing traffic to flow between them.
cbr0: A Linux Ethernet bridge is a virtual Layer 2 networking device used to unite two or more network segments, working transparently to connect two networks together. The bridge operates by maintaining a forwarding table between sources and destinations by examining the destination of the data packets that travel through it and deciding whether or not to pass the packets to other network segments connected to the bridge. The bridging code decides whether to bridge data or to drop it by looking at the MAC-address unique to each Ethernet device in the network. Bridges implement the ARP protocol to discover the link-layer MAC address associated with a given IP address. When a data frame is received at the bridge, the bridge broadcasts the frame out to all connected devices (except the original sender) and the device that responds to the frame is stored in a lookup table. Future traffic with the same IP address uses the lookup table to discover the correct MAC address to forward the packet to.
Pod to pod on different node
When traffic reaches cbr0 at VM1, ARP will fail at the bridge because there is no device connected to the bridge with the correct MAC address for the packet. On failure, the bridge sends the packet out the default route — the root namespace’s eth0 device.
Once the traffic leaves the node, we assume that the network can route the packet to the correct Node based on the CIDR block assigned to the node.
Generally speaking, each Node knows how to deliver packets to Pods that are running within it.
Once a packet reaches a destination Node, packets flow the same way they do for routing traffic between Pods on the same Node.
The network routes traffic for Pod IPs to the correct Node that is responsible for those IPs. This is network specific, and is done by the implementation of CNI plugin
Pod to service networking
Pods are ephemeral, so we use K8S Service to manage the state of a set of Pods. Each service is assigned a single virtual IP (cluster IP). Any traffic addressed to the virtual IP will be routed to a set of Pods which are associated with the virtual IP.
Iptables loadbalance traffic from Service to Pods
Kubernetes automatically creates and maintains a distributed in-cluster load balancer that distributes traffic to a Service’s associated healthy Pods. This could be done by iptables.
iptables is a user-space program providing a table-based system for defining rules for manipulating and transforming packets using the netfilter framework.
Netfilter is a framework provided by Linux that allows various networking-related operations to be implemented in the form of customized handlers. Netfilter offers various functions and operations for packet filtering, network address translation, and port translation, which provides the functionality required for directing packets through a network, as well as for providing the ability to prohibit packets from reaching sensitive locations within a computer network.
The kube-proxy which watches the changes to a Service or a Pod. If a change updates the cluster IP address or the Pod IP address, kube-proxy updates the iptables rules to route the traffic directed at a Service to a backing Pod.
The iptables rules watch for traffic destined for a Service’s virtual IP and, on a match, a random Pod IP address is selected from the set of available Pods and the iptables rule changes the packet’s destination IP address from the Service’s virtual IP to the IP of the selected Pod. On the return path, the IP address is coming from the destination Pod. In this case iptables again rewrites the IP header to replace the Pod IP with the Service’s IP so that the Pod believes it has been communicating solely with the Service’s IP the entire time.
IPVS loadbalance traffic from Service to Pods
IPVS is another in-cluster load balancing solution which is also built on top of Netfilter. IPVS is specifically designed for load balancing and uses more efficient data structures (hash tables), allowing for almost unlimited scale compared to iptables. When creating a Service load balanced with IPVS, three things happen:
A dummy IPVS interface is created on the Node
The Service’s IP address is bound to the dummy IPVS interface
IPVS servers are created for each Service IP address
Pod to Service traffic workflow
Traffic addressed to a Service leaves the Pod1 through eth0 (src: pod1, dst: svc1)
Then reaches the bridge cbr0, and ARP protocal running on the bridge does not know about the Service
The traffic is transferred to the eth0 under its network namespace
Packets are filtered through iptables before reaching eth0. iptables uses the rules installed on the Node by kube-proxy in response to Service or Pod events to rewrite the destination of the packet from the Service IP to a specific Pod IP (src: pod1, dst: pod4)
)
Traffic then flows to the Pod using the Pod-to-Pod routing we’ve already examined
The Linux kernel’s conntrack
utility is leveraged by iptables to remember the Pod choice that was made so future traffic is routed to the same Pod (barring any scaling events).
Service to Pod workflow
The Pod that receives this packet will respond, identifying the source IP as its own and the destination IP as the Pod that originally sent the packet (src: pod4, dst: pod1)
Upon entry into the Node, the packet flows through iptables, which uses
conntrack
to remember the choice it previously made and rewrite the source of the packet to be the Service’s IP instead of the Pod’s IP (src: svc1, dst: pod1)From here, the packet flows through the bridge to the virtual Ethernet device paired with the Pod’s namespace
And to the Pod’s Ethernet device as we’ve seen before
Internet to Service networking
Egress
Internet gateway: The Internet gateway serves two purposes: providing a target in your VPC route tables for traffic that can be routed to the Internet, and performing network address translation (NAT) for any instances that have been assigned public IP addresses.The NAT translation is responsible for changing the Node’s internal IP address that is private to the cluster to an external IP address that is available in the public Internet.
Pod to Internet traffic workflow
Traffic travels through eth0-veth0 pair (src: pod1, dst: 8.8.8.8)
Bridge cbr0 does not match any network segment attached, and moves the traffic to eth0
iptables modifies the packet's source from pod1 to vm-internal-ip (If we do not change to the vm-internal-ip, the internet gateway will reject the packets since the NAT from internet gateway only recognize the IP address of the vms attached to it) (src: vm-internal-ip, dst: 8.8.8.8)
Traffic leaves vm and reaches the Internet Gateway
Internet Gateway will do another NAT rewriting the source IP from vm-internal-ip to vm-external-ip (src: vm-external-ip, dst: 8.8.8.8)
Traffic finally reaches the public Internet.
On the way back, the packet follows the same path and any source IP mangling is undone so that each layer of the system receives the IP address that it understands: VM-internal at the Node or VM level, and a Pod IP within a Pod’s namespace.
Ingress
L4 Ingress: LoadBalancer to Service workflow
User specify the LoadBalancer as the Service type when they create the Service.
Traffic reaches Load Balancer
And traffic will be redirected to the proper vm
iptables on each vm will direct the incoming traffic to correct Pod by using hte internal load balancing rule
For the return traffic, iptables and conntrack
is used to rewrite the IPs correctly on the return path, as we saw earlier.
L7 Ingress: Ingress to Service workflow
The life of a packet flowing through an Ingress is very similar to that of a LoadBalancer. The key differences are:
that an Ingress is aware of the URL’s path (allowing and can route traffic to services based on their path)
that the initial connection between the Ingress and the Node is through the port exposed on the Node for each service
Differences between NodePort vs LoadBalancer vs Ingress
Terminology
Layer 2 Networking
Layer 2 is the data link layer providing Node-to-Node data transfer. It defines the protocol to establish and terminate a connection between two physically connected devices. It also defines the protocol for flow control between them.
Layer 4 Networking
The transport layer controls the reliability of a given link through flow control. In TCP/IP, this layer refers to the TCP protocol for exchanging data over an unreliable network.
Layer 7 Networking
The application layer is the layer closest to the end user, which means both the application layer and the user interact directly with the software application. This layer interacts with software applications that implement a communicating component. Typically, Layer 7 Networking refers to HTTP.
NAT — Network Address Translation
NAT or network address translation is an IP-level remapping of one address space into another. The mapping happens by modifying network address information in the IP header of packets while they are in transit across a traffic routing device.
A basic NAT is a simple mapping from one IP address to another. More commonly, NAT is used to map multiple private IP address into one publicly exposed IP address. Typically, a local network uses a private IP address space and a router on that network is given a private address in that space. The router is then connected to the Internet with a public IP address. As traffic is passed from the local network to the Internet, the source address for each packet is translated from the private address to the public address, making it seem as though the request is coming directly from the router. The router maintains connection tracking to forward replies to the correct private IP on the local network.
NAT provides an additional benefit of allowing large private networks to connect to the Internet using a single public IP address, thereby conserving the number of publicly used IP addresses.
SNAT — Source Network Address Translation
SNAT simply refers to a NAT procedure that modifies the source address of an IP packet. This is the typical behaviour for the NAT described above.
DNAT — Destination Network Address Translation
DNAT refers to a NAT procedure that modifies the destination address of an IP packet. DNAT is used to publish a service resting in a private network to a publicly addressable IP address.
Network Namespace
In networking, each machine (real or virtual) has an Ethernet device (that we will refer to as eth0
). All traffic flowing in and out of the machine is associated with that device. In truth, Linux associates each Ethernet device with a network namespace — a logical copy of the entire network stack, with its own routes, firewall rules, and network devices. Initially, all the processes share the same default network namespace from the init process, called the root namespace. By default, a process inherits its network namespace from its parent and so, if you don’t make any changes, all network traffic flows through the Ethernet device specified for the root network namespace.
veth — Virtual Ethernet Device Pairs
Computer systems typically consist of one or more networking devices — eth0, eth1, etc — that are associated with a physical network adapter which is responsible for placing packets onto the physical wire. Veth devices are virtual network devices that are always created in interconnected pairs. They can act as tunnels between network namespaces to create a bridge to a physical network device in another namespace, but can also be used as standalone network devices. You can think of a veth device as a virtual patch cable between devices — what goes in one end will come out the other.
bridge — Network Bridge
A network bridge is a device that creates a single aggregate network from multiple communication networks or network segments. Bridging connects two separate networks as if they were a single network. Bridging uses an internal data structure to record the location that each packet is sent to as a performance optimization.
CIDR — Classless Inter-Domain Routing
CIDR is a method for allocating IP addresses and performing IP routing. With CIDR, IP addresses consist of two groups: the network prefix (which identifies the whole network or subnet), and the host identifier (which specifies a particular interface of a host on that network or subnet). CIDR represents IP addresses using CIDR notation, in which an address or routing prefix is written with a suffix indicating the number of bits of the prefix, such as 192.0.2.0/24 for IPv4. An IP address is part of a CIDR block, and is said to belong to the CIDR block if the initial n bits of the address and the CIDR prefix are the same.
CNI — Container Network Interface
CNI (Container Network Interface) is a Cloud Native Computing Foundation project consisting of a specification and libraries for writing plugins to configure network interfaces in Linux containers. CNI concerns itself only with network connectivity of containers and removing allocated resources when the container is deleted.
VIP — Virtual IP Address
A virtual IP address, or VIP, is a software-defined IP address that doesn’t correspond to an actual physical network interface.
netfilter — The Packet Filtering Framework for Linux
netfilter is the packet filtering framework in Linux. The software implementing this framework is responsible for packet filtering, network address translation (NAT), and other packet mangling.
netfilter, ip_tables, connection tracking (ip_conntrack, nf_conntrack) and the NAT subsystem together build the major parts of the framework.
iptables — Packet Mangling Tool
iptables is a program that allows a Linux system administrator to configure the netfilter and the chains and rules it stores. Each rule within an IP table consists of a number of classifiers (iptables matches) and one connected action (iptables target).
conntrack — Connection Tracking
conntrack is a tool built on top of the Netfilter framework to handle connection tracking. Connection tracking allows the kernel to keep track of all logical network connections or sessions, and direct packets for each connection or session to the correct sender or receiver. NAT relies on this information to translate all related packets in the same way, and iptables can use this information to act as a stateful firewall.
IPVS — IP Virtual Server
IPVS implements transport-layer load balancing as part of the Linux kernel.
IPVS is a tool similar to iptables. It is based on the Linux kernel’s netfilter hook function, but uses a hash table as the underlying data structure. That means, when compared to iptables, IPVS redirects traffic much faster, has much better performance when syncing proxy rules, and provides more load balancing algorithms.
DNS — The Domain Name System
The Domain Name System (DNS) is a decentralized naming system for associating system names with IP addresses. It translates domain names to numerical IP addresses for locating computer services.
References
Last updated