KubeStory #001: Kubernetes Networking Policies
This story is about Kubernetes Security and Networking Policy implementation
Kubernetes Networking Policies
The default behavior of Kubernetes is to allow any traffic between two pods in the cluster.
This behavior makes the 101 of Kubernetes Networking easy and flexible. Deploy pods and let the Kubelets, CNI plugin, and DNS provider do the job. It just works!
However, what happens when an attacker has access to a backend service and starts probing and scanning the network, looking for vulnerabilities inside other services? It is then relatively easy to extract data from the network to an external database with open egress.
There are multiple ways to achieve pod isolation, and it can be applied to various layers, from L3 to L7, while implementing complex routing strategies. Today I want to focus on the NetworkingPolicies resource, offered natively by Kubernetes.
Enjoy the story!
Uses cases for Networking Policies are not only external exposures
The first time I have encountered Kubernetes Network Policies was in 2017, after it became stable with the Kubernetes 1.7 release, even if it was available for users since 1.3. In a meeting with a customer working in the Gaming industry who wanted to achieve network isolation between namespaces. The company had adopted a tribe structure where every tribe was in charge of their game backend on the cluster. One centralized platform team was responsible for a Kubernetes cluster with all the company requirements and tools to provide a good developer experience.
Nevertheless, after a few months, they realized that some developers were using backend services from other teams. It was not an issue at first, but the responsibility map became harder and harder to navigate. This complexity is why they chose to implement primary namespace isolation where pods could only communicate within their Namespace by default and sometimes authorized access to other team services.
It is not always about external security or reducing the blast radius of an attack. Sometimes the use case for Networking Policies is also to control the usage of a service by internal authorized users.
Applying the least privilege principle gives more control over workloads exposition. It is also an excellent addition to a Defense-in-depth strategy. One more layer in case of intrusion. Because, even if this system is perfectible, it is still an added protection that will make the system harder to manipulate. With this in mind, it makes sense to use Network Policy in addition to mTLS, Auth Token, or other security tools.
It is now pretty much a given that any production Kubernetes cluster should use Network Policy.
This repository is full of recipes and examples that can be copy-paste. However, the online presence of many illustrations is why this article does not include these examples.
Working with Networking Policies requires automation
Networking Policies are Kubernetes resources configured via declarative manifests, potentially embedded in YAML files. A repository close to the application or a dedicated platform repository is the best way to manage these files.
With Gitops becoming the De Facto way of managing Kubernetes resources, it is now common to automatically deploy Policies on clusters. This automation will help organizations with a growing number of rules. For example, clusters with thousands of Networking Policies are not rare, and managing this amount of Policies is nearly impossible. Therefore, whatever the tool used in the cluster, Argo, Github Actions, or Anthos Config Manager, the Networking Policies should be part of the cluster lifecycle. Automation will also give rollback potential and protection from configuration drifts.
Be careful when deploying these resources because the impact is real-time, and existing open connections will be affected.
Because they are Kubernetes resources, tools like Kyverno, OPA, or Anthos Policy Controller can enforce their usage on clusters. There are exciting types of interaction with Admission Webhook, among which:
Enforce that an existing Networking Policy should target every pod
Validate and enforce Networking Policy rules
Create default Networking Policy for new Namespace
Networking Policy Implementation
Networking Policies do not ensure communication between pods and are not a Kubernetes core functionality. Instead, networking Policies act as a configuration for the cluster's CNI plugin.
The implementation may vary from CNI plugin to another, so check the cluster's CNI plugin support. For instance, some plugins allow Policies to use L3/L4 while others support L7 rules.
Note: L7 rules are not part of the NetworkingPolicy object, but some implementations like CiliumNetworkingPolicy make it possible.
Interestingly enough, at first, the design proposal only included ingress policies! Therefore, read both the design proposal and the API reference. It contains a few essential implementation outcomes summarized in an excellent article from Ahmet (In the section: "How are Network Policies evaluated").
The recommendation to stay aligned with the native Kubernetes API is to use L3/L4 rules and add a Service Mesh to the cluster to handle L7 rules if necessary. This recommendation is critical given the multi-cloud strategy that most are implementing. More later on this subject!
Selectors and Rules
Writing Networking Policies does not require much information. The resource is composed of :
Ingress rules
Egress rules
Targeted pods
Ingress and Egress rules have the same structure. They contain a list of ports and references to entities via ipBlock, namespaceSelector, and podSelector. Combining entity references is possible but implies that the targeted pods check all the references.Â
When a Network Policy targets a pod, the default block is applied, and Ingress and Egress rules act as exceptions. In other words, rules add access; they do not block access.
Try to keep the Networking Policies the more precise as possible. Multiple policies can target a pod, which is cumulative, so use it to apply a separation of concern.
Back to the story
The formerly mentioned gaming customer recently contacted us to articulate two subjects and inspired this KubeStory.
The first issue was an email sent by Google regarding his usage of CiliumNetworkingPolicy. The email indicates that Google does not support the Cilium CRDs, and their use may imply undesirable behaviors. Moreover, upgrading the control plane to a newer version that does not support these CRDs will result in the deletion of these policies.
Indeed, the CiliumNetworkingPolicy is not part of the native Kubernetes API and requires additional CRDs. So the question was unavoidable: Why was it feasible in the earlier versions of GKE? Furthermore, why does Google provide something that it does not support?
A narrative of unneeded details
The origin of this problem is the will of openness and transparency. On the 19th of August 2020, Google published an article to introduce a new feature called Dataplane V2. The earlier noted desire for openness (I guess) brings the article to mention two technologies: eBPF and Cilium. The mention of Cilium brought delight and excitement to the community because this remarkable technology is much more than a simple CNI plugin. One example was the added CiliumNetworkingPolicy capacity.
This detail was unneeded and triggered unwanted usage. As a result, the customer seized an excellent opportunity to go from NetworkPolicy to CiliumNetworkingPolicy and add new capabilities to his cluster, like L7 policies (path and FQND). Moreover, this was fantastic news because multiple Kubernetes providers support Cilium. So it became the Network default strategy for his multi-cloud platforms.Â
However, the resources are not supported, and Google disabled the CRDs to avoid uncontrolled usages. It is essential to understand that Dataplane V2 is not a managed Cilium installation.Â
Anetd, the on-node agent for Dataplane V2, is not the Cilium Agent. It includes many changes necessary for some of the GKE functionality.
Anetd is also not the same as Netd: Anetd replaces kube-proxy and Calico, Netd implements GKE specific functionality like intranode visibility, workload identity, and others.
A simple change was made in the configuration of Anetd to disable the installation of the targeted CRDs.
To check an installation :
kubectl -n kube-system get ds anetd -o yaml
And look for :
--disable-network-policy-crd=true
The moral of this story? Even if we love to deep dive and give as much information as possible, sometimes implementation details should stay what they are, implementation details.
The second reported issue was about the management of policies at scale.
How to manage hundreds of policiesÂ
The problem with YAML files is that they grow exponentially, and it is hard to keep an understanding hierarchy of files.
This customer had already implemented GitOps automation, so it was not a question about automation but methodology.
To find a solution, we wrote a list of questions to help our decision making:
Who will write the YAML (Developers, platform team, SRE): We will also use this data to produce RBAC rules
What is the lifecycle of this object: Is it linked to the application or the cluster? Some policies will live in the service folder, and others will live in the cluster configuration.
Does it contain confidential information: Usage of gitignore can impact the file location
What is the target of the policy: We have decided to separate policies targeting service and policies used for infrastructure (monitoring, CI/CD, and others).
The answers to these questions will be different for every company.Â
We decided to build a list of questions, not an out-of-the-box folder organization because we knew these answers were not universal.
This is how the story ends.
This story was about NetworkingPolicy and its usage. Hopefully, it is now apparent that every production Kubernetes cluster should implement it.
If policies examples can help, look at the already existing excellent repository of recipes.
Use the commentary section to tell your NetworkingPolicy story!
Share your feedbacks or your own stories here!