Desired State of Microservices Operations

In recent years in IT Operations life has changed a lot. It has come a long way from manually running shell scripts to fully automated DevOps processes, suppported by containerized cloud environments. People started to realize that instead of drag-and-drop interfaces, everything can be defined as code, primarily through domain-specific languages (DSLs) like CloudFormation, ARM, or Terraform. This shift led to the term "DevOps" with the implication that, if everything is code, traditional Ops might no longer be needed.

However, developers, accustomed to their familiar programming languages and fast local environments, were often reluctant to adopt unfamiliar DSLs. As a result, Operations teams were pushed to adopt DevOps practices, but with limited success. Today, we’re seeing a new trend where programming languages are starting to replace DSLs (e.g. Pulumi, Cloud CDK, Dagger), and practices like GitOps and Continuous Deployment are becoming mainstream. The automation of infrastructure has also driven the rise of Microservices and Monoliths has fallen from grace.

Imagine being tasked with transforming an existing on-premise legacy high-volume e-commerce application into a cloud-based microservices architecture (with 50+ services), all while maintaining the same level of agility and enabling continuous delivery and deployment. If each service is placed in its own Git repository and developed by separate teams, the resulting complexity can quickly become overwhelming, potentially paralyzing the entire delivery process—often leading to project failure. So, how can we tackle this challenge? What strategies can we adopt to manage the complexity effectively? And ultimately, what is the Desired State of IT operations having Microservices?

The key idea to manage complexity can be traced back to 2004, when Mark Burgess, a theoretical physicist, introduced the concepts of Desired State and Convergent Operators within Promise Theory. Starting with CFEngine and later evolving through systems like Kubernetes (Borg by Google) and GitOps, this idea has shaped modern infrastructure management.

At its core, the approach emphasizes that a complex system can be constructed from simpler, autonomous agents. Each agent operates independently, making promises that may provide value to others. Together, these agents exhibit complex, emergent behaviors. Burgess was inspired by statistical mechanics, where individual atoms, governed by quantum mechanical principles, collectively give rise to complex Emergent Behaviors.

These autonomous agents could be envisioned as self-sufficient, resilient entities capable of adapting quickly to environmental changes or failures. The resiliance patterns or aspects - as coined by develpers - led to the appearance of Service Meshes and Sidecar Proxies in the 2010s. Just as in web development, where popular features introduced by third-party frameworks eventually become part of browser standards, Kubernetes and Service Meshes are evolving in a similar way. This evolution is evident in the emergence of the Gateway API (with constructs like HttpRoute and GatewayClass) and the rise of Sidecarless Meshes. There is also a growing effort to make these agents as environment-agnostic as possible. While achieving this level of abstraction remains a challenge, technologies like WASI (WebAssembly System Interface) offer a promising direction forward.

GitOps principles

  1. 1. Declarative: a system managed by GitOps must have its Desired State expressed declaratively.
  2. 2. Versioned and Immutable: desired state is stored in a way that enforces immutability, versioning and retains a complete version history.
  3. 3. Pulled Automatically: agents automatically pull the desired state declarations from the source.
  4. 4. Continuously Reconciled: agents continuously observe actual system state and attempt to apply the desired state.
DGITAL Labs airline software development microservices operations
GitOps flow

It's no coincidence that Git and Kubernetes are perfect fit with GitOps principles. Kubernetes Custom Resource Definitions (CRDs) serve as ideal representations of the Desired State. Git, with its versioned and immutable commits, provides a reliable source of truth for these CRDs. By sourcing CRDs directly from a Git repository, custom Kubernetes controllers can continuously monitor changes in Git, creating or updating CRDs as needed. These controllers then reconcile the cluster state with the Desired State defined in the CRDs, ensuring the system remains consistent. All of this happens automatically, making GitOps a powerful approach for managing infrastructure and applications declaratively.

In an ideal future, the only responsibility for developers and operators is to update Git repositories. By simply creating a Pull Request, the process of reviewing, quality control, and deployment to production becomes fully automated. This approach promises speed, security, and cost efficiency, allowing bug fixes to reach production in mere minutes. Modern DevOps tools like GitLab, ArgoCD, and Flux are paving the way for this vision.

At Dgital we have built a custom solution leveraging these concepts and tools, enabling canary deployments across more than 50 services. In practice we had these challenges to be solved:

1. Concurrent modifications

Git excels at many tasks, including versioning, merging, and auditing. However, frequent automatic updates can often result in conflicts that require manual resolution. As part of implementing GitOps, we had to store certain states directly in CRDs, utilizing their optimistic locking mechanism to manage state consistency.

2. Daily operations often have limited development expertise

Editing Kubernetes YAML files directly is not ideal. We found that using frameworks like CDK8s produces better results, but adopting these tools can be challenging for those in daily operations. To address this, we developed an API to facilitate basic editing functions for these files, such as deploying a version, setting Canary to 40%, rolling back, or finalizing deployments. Additionally, we created a custom, user-friendly dashboard to make these API functions even easier to use.

3. Partial deployments, backward compatibility

Microservices projects often face the challenge of deploying everything at once or requiring developers to manage backward compatibility. These projects are typically stored in one or more monorepos, where handling and testing backward compatibility across multiple services can be extremely difficult, if not impossible.

Monorepo tools like NX, however, can calculate exactly which projects need to be deployed to a given environment. If a tool existed that could "jump" between commits, developers wouldn't need to worry about backward compatibility. Instead, their focus would be solely on maintaining consistency within the repository, which can be ensured through fast, automated test.

The real challenge arises when implementing Canary or Blue-Green deployments. During a release, traffic is split between the old and new versions, and when only a few services are changed, shared services must be able to differentiate between routing to the new and old versions. We were eventually able to solve this issue by using bucketing through Envoy Lua extensions.

4. Fault tolerance, unaligned services

Multi-service deployments are often not atomic and can result in user-facing failures that are difficult to diagnose, especially when caused by temporarily unaligned or unavailable services. To manage this complexity, teams typically rely on service meshes like Istio, Linkerd, or Cilium. However, Docker instances can consume significant resources, and service mesh sidecars add even more overhead. While Linkerd offers small and fast sidecars, they are not customizable, whereas Istio provides customization but with sidecars comparable in size to service containers. Fortunately, Linux eBPF capabilities have enabled sidecarless service meshes, with Istio's Ambient mode serving as a prime example of this approach.

Summary

It took several months to set up a working solution that met our original goals. Along the way, we experimented with and discarded many concepts. While this architecture may seem intimidating at first, and developers may initially hesitate to adopt it, the transparency and flexibility it provides in managing infrastructure make it extremely powerful and can led to real DevOps and Ops can transform into Platform Engineers.

As a general guideline, avoid using microservices unless they're absolutely necessary. However, if you can't avoid them, investing in DevOps is essential.

Links:

GitOps: https://medium.com/weaveworks/gitops-operations-by-pull-request-14e8b659b058
GitOps principles: https://opengitops.dev/
NX: https://nx.dev/
Kubernetes Custom Resources: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
AWS CDK: https://aws.amazon.com/cdk/
Cdk8s: https://cdk8s.io/
Flux: https://fluxcd.io/
Istio Ambient mode: https://istio.io/latest/docs/ambient/overview/