It is time to let go on your service mesh dream
But you can still find a solution for your problem
This week's been a bit of a rollercoaster on the tech side! I tangled with setting up Cilium, had fun figuring out some marketplace integration, and jumped into a debugging adventure. But there's something else that’s been buzzing in my head.
It all started with what I thought would be a chill meeting with our bigger team and one of our product managers. We were chatting about the cool new stuff in Cilium 1.14 when our PM tossed out a question that’s been sticking with me:
"Do they really need a service mesh?"
This zinger landed in the middle of our talk about moving from other service meshes like Istio and Linkerd over to Cilium, and boy, did it come out of left field.
I want to walk you through how we arrived at this point!
Service meshes are everywhere except in clusters.
Navigating IT trends can be quite an eye-opener. As a dedicated Kubernetes enthusiast, I initially believed Kubernetes was universally adopted. That belief persisted until I encountered the actual adoption data. It's a fascinating realization that our depth in the tech world shapes our content consumption, which informs our search algorithms. This digital echo chamber influences not only our online experience but also our real-world interactions and discussions.
The more we engage with specific technology content, the more we're fed similar information, subtly shaping our perceptions. This phenomenon creates a feedback loop, leading to the belief that certain technologies, like Kubernetes in my case, are more widespread than they are. It's a reminder that our tech 'bubble' can sometimes distort our view of the broader IT landscape.
This realization underscores the importance of stepping back to gain a broader perspective. It's crucial to be mindful of the content we consume and actively seek diverse viewpoints to avoid echo chambers. This approach helps understand the diversity and variance in technology adoption across different sectors and regions.
Perceived Ubiquity vs. Actual Adoption
A 2020 CNCF survey indicated that only about 27% of respondents used service mesh in production. In 2022 the adoption was growing but still low regarding the overall landscape.
This suggests that service mesh technologies like Istio, Linkerd, and Consul are widely discussed in the industry and have a significant presence in cloud-native discussions. However, their practical, real-world usage might be less extensive than the conversation around them implies.
Several factors could be at play here. Perhaps the additional complexity and overhead of implementing a service mesh might deter some users, especially when their existing infrastructure is already meeting their needs without it. Moreover, the Kubernetes landscape continuously evolves, with different organizations having varied requirements and maturity levels. These dynamics might influence the slower-than-expected adoption of service meshes as a standard practice.
Complexity as a Barrier
I don't have concrete data to back this up. So, as always, approach online information with a healthy dose of skepticism! From my experience and numerous discussions with customers about their Kubernetes usage, the primary challenge with service meshes seems to revolve around their complexity.
It's not uncommon to encounter situations where, for instance, someone is managing a service mesh in production with only path forwarding enabled or perhaps just for mutual TLS (mTLS). In these cases, they often highlight adoption issues – developers either aren't keen on using it or simply don't feel the need.
Here's my top list of insights gleaned from these conversations:
Configuration Overhead:
Technical Aspect: Service meshes often require intricate configuration to tailor their behavior to specific needs. This includes setting up routing rules, policies for retries and timeouts, and security parameters.
Impact: The initial setup can be daunting for teams not well-versed in these configurations. Small misconfigurations can lead to significant issues in production, making the adoption risky for teams without sufficient expertise.
Integration with Existing Systems:
Technical Aspect: Integrating a service mesh into an existing infrastructure can be complex, especially one not initially designed with a service mesh in mind. It involves understanding existing network flows and ensuring they are compatible with the mesh's operational model.
Impact: Organizations may find the process disruptive and resource-intensive, mainly if it requires refactoring applications or modifying network policies.
Performance Overhead:
Technical Aspect: Service meshes introduce an additional layer in the network stack, which can lead to increased latency and resource consumption. The impact varies based on the architecture and the specific service mesh used.
Impact: Even a tiny latency or resource usage increase can be significant in performance-sensitive environments. This necessitates careful planning and resource allocation, adding to the complexity of deployment.
Learning Curve and Expertise:
Technical Aspect: Understanding the full capabilities of a service mesh, such as Istio or Linkerd, requires a steep learning curve. This includes grasping concepts like service discovery, load balancing, circuit breaking, observability, and security.
Impact: Teams must invest time and resources in training to leverage a service mesh effectively. In organizations where resources are limited, this can be a significant barrier.
Maintenance and Upgrades:
Technical Aspect: Regular maintenance, updates, and troubleshooting of a service mesh demand high operational expertise.
Impact: Keeping the service mesh up-to-date and running smoothly can be a continuous challenge, requiring dedicated personnel and potentially leading to downtime during upgrades.
Vendor Ecosystem and Lock-In:
Technical Aspect: There's a growing ecosystem of service mesh providers, each with its features and configurations.
Impact: Choosing a specific service mesh can lead to vendor lock-in, complicating future infrastructure changes and requiring a thorough understanding of the chosen technology's roadmap and support structure.
How do you solve your requirements?
I’m not implying that your platform lacks unmet needs or that the requirements of your applications team are non-existent. Instead, we might have missed considering the capabilities already present in our existing tools.
Let’s examine the top three use cases commonly highlighted in service mesh presentations and see how your current CNI, specifically Cilium, addresses these needs.
You want mTLS.
Take mutual TLS (mTLS), for example. I’ve yet to encounter someone who can look me in the eye and assert they specifically need mTLS. More often, what they actually require are the underlying solutions that mTLS provides.
Digging deeper, we usually find that the core need is an identity system. One that reliably establishes and verifies the sender and receiver's identities, along with ensuring encryption. A service mesh might not be necessary if these are your primary concerns. Your existing setup with Cilium could well be equipped to handle these requirements effectively.
Transparent Encryption:
Cilium can leverage IPsec to provide transparent encryption for pod-to-pod traffic. IPsec is a suite of protocols that supports the encryption of IP packets at the network layer. It can ensure that the data in transit is not visible to unauthorized entities, similar to how mTLS encrypts application-level traffic.
The advantage of using IPsec is that it operates transparently to the application, meaning that developers don't need to change their code to enable encrypted communication. (also a reason why people love sidecars)
Network Policy for Secure Identity:
Cilium assigns a unique identity to each pod based on Kubernetes labels, which is more robust and scalable than simple IP-based ACL systems. Network policies in Cilium can then be enforced based on these identities, which allows administrators to define who (which pod) can talk to whom (other pods), even across different namespaces.
eBPF-Powered Security:
eBPF allows Cilium to enforce network policies at the kernel level, providing high-performance and low-latency security mechanisms. This is crucial for maintaining network performance while still ensuring secure communication channels.
The power of eBPF coupling with envoy also allows Cilium to perform Layer 7 policy enforcement, which can inspect and filter HTTP, gRPC, and Kafka protocols, adding a layer of security typically provided by service meshes.
Integration with External Certificate Management:
For teams that still require TLS certificates for service identity, Cilium can be integrated with external certificate management solutions like cert-manager.
You want observability.
As a networking and security solution for Kubernetes, Cilium offers a range of features that can address many observability needs in a Kubernetes environment. Traditionally, service meshes are used in Kubernetes to provide detailed observability, among other functionalities. However, Cilium's capabilities, particularly in leveraging eBPF, enable it to serve a similar role in many aspects.
I will not resent everything, and there are already lots of good articles about Hubble and Cilium, but briefly, without instrumentation, cilium will give you the following:
Packet-Level Visibility
Application-Level Observability
Security Monitoring
Integration with Monitoring Tools
Distributed Tracing
Service Dependency Analysis
Minimal Overhead
You want routing superpower.
Rather than requiring you to adopt and integrate new Custom Resource Definitions (CRDs), Cilium configures its Envoy proxy as the data plane, efficiently utilizing existing resources for the control plane. This approach lets you leverage your current ingress, Gateway API, or service annotations. With these tools, you can seamlessly develop header routing and path rewriting strategies, aligning with your existing infrastructure and workflows.
You need the right CNI.
I'm not suggesting that this approach will address every single requirement you have. My aim here is to share some insights based on recent discussions. In many of these conversations, we realized that deploying mutual authentication for Cilium wasn't necessary. We could smoothly transition away from our existing service mesh by thoroughly understanding Cilium's identity model, setting up encryption, and leveraging the Gateway API for control.
Adopting this method can significantly benefit your team by reducing the barriers to entry. Perhaps more crucially, it empowers your platform team with ownership and control. They maintain Cilium’s health, allowing developers to focus on what they do best: coding and deploying with a simple 'git push.'
Music of the week
This week, I was swept away by the enchanting melodies of Ólafur Arnalds as I navigated through work and meetings. I stumbled upon his live performance, which I had previously missed, and it was a delightful discovery. There's an indescribable quality to Icelandic classical music that resonates deeply with me. It evokes a spectrum of emotions, ranging from melancholy to sheer joy. And now, as winter gradually sets in, I can't think of a better musical companion for the season.