To understand this article and the underlying concepts, let me briefly explain the most important cornerstones of our development environment and infrastructure.
Environment
We are currently using a single-stage kubernetes environment – called stable – to deploy all of our applications. We basically follow a microservice approach. Therefore, every application might depend on other services to run properly. So far, this is nothing special.
Runtime
We use Google Cloud to provide Kubernetes (GKE). As you can imagine, we also want to use other managed services, apart from GKE to make full use of the possibilities the cloud offers. Otherwise, we would have to provide services like databases, redis, or even AI services ourselves within kubernetes. This automatically raises the question how to provide, say, a database alongside a normal kubernetes deployment.
Config Connector and Config Sync
Our solution for this problem is Google Config Connector. In a nutshell, it allows us to not only create kubernetes-native resources like deployments, ingresses or services, but GCP-native resources like databases, redis instances or even DNS names as well. It’s important to note here, that the resource description is still provided as a kubernetes resource in the form of a *.yaml file. The final part in our setup is Config Sync, another operator which is capable of checking out a git repository (our deployment repo) and applying all containing resources to the cluster.
Dynamic environments
Finally, a single-stage approach comes with some obvious drawbacks. For instance, problems might arise when:
- An application needs a different version of a dependent service than the deployed one.
- An application is running tests and impacts other services.
- A developer wants to test some behavior without modifying the data structure.
As a solution we have come up with a way to deploy any application in an ephemeral sub-environment where other required services are available as well. This is done by essentially copying all kubernetes descriptors of required services along with the main application to a new namespace. After the tests have concluded, the whole namespace is deleted and cleaned up again. This approach works so well that we started to use it essentially for each new branch. Here is a rough example for the steps executed to create a branch for an application until it’s merged to master:
- Developer creates a branch of application A and commits it. A build is triggered for the new version.
- The build was successful, a deployment with the new version to a dynamic environment is triggered along with service B,C and D. The generated *.yaml files are pushed to the deployment repository.
- Config sync picks up the new *.yaml files and applies them to the cluster. This is also called reconciliation, because the desired state of the cluster is reconciled with the actual state.
- The dynamic environments are running properly and the integration tests are started.
- After the tests are finished, the developer merges their branch to main. This triggers a teardown of the dynamic environment by essentially deleting the previously generated files.
- Again, the reconciler picks up the changes in the deployment repo and deletes the corresponding kubernetes resources.
Problem description
With this out of the way it is finally time to describe the problems that come along with this concept.
Resource maximum
Since every resource applied on the cluster is in the deployment repository, we tested if there’s an upper limit of resources that can be applied without problem, and as you might have expected, there is one. This was rather unexpected as we thought the implementation of Google’s config connector was tested in heavyweight applications.
The limit is around 5000 resources in total and stems from the way the reconciler stores information about the state of these resources. Per root-sync object (which is responsible for handling one git repository/branch), a kubernetes object of type resourcegroup is created:
root@kube:~$ kubectl get resourcegroups.kpt.dev -A NAMESPACE NAME RECONCILING STALLED AGE config-management-system root-sync False False 21d root@kube:~$
This resourcegroup contains relatively mundane state information about the applied resources:
root@kube:~$ kubectl get resourcegroups.kpt.dev -A -o yaml ... - actuation: Succeeded group: "" kind: Service name: artemis namespace: artemis reconcile: Succeeded sourceHash: 545082d status: Current strategy: Apply ...
Because this information is stored in one single object, it’s subject to etcd’s size limit, which amounts to the upper limit of 5k resources as mentioned above. This limit cannot be extended by gke users because it’s part of the control plane where no access for customers is possible.
Congestion
Another limitation of a single reconciler is congestion. During operation, the reconciler checks the deployment repository in regular intervals of 15 seconds. If a new commit is registered, the changes are applied to the cluster, often including new databases, multiple deployments and cpu-intensive startup tasks. If another commit is done immediately after the first one, these resources are delayed until the first run has either finished or timed out. Only after that, the new changes are taken into account. In actual numbers, one reconciler run can take up to five minutes, for instance if a new node has to be spawned by the autoscaler. This time is then added to the deployment time of the current application.
Timeouts
The procedure described in the previous section happens even during execution of the happy path. Sometimes, however, a new commit/deployment may contain an error. It could be an erroneous image name that can’t be pulled, a missing permission or any other problem hindering the deployment to reach a ready state. In this case, the reconciler has to wait up to the reconciler timeout (5 minutes per default), until it gives up and continues with applying the remaining resources. To make matters worse, this timeout can happen in every wait group.
A wait group reflects dependencies declared in an application. An example is a deployment that only starts, when the database and the volume claim is ready. The volume claim waits for the physical volume, and the physical volume, in turn, waits for the gcp disk to be ready. This dependency graph is automatically created by the reconciler and sorted into different wait groups, which are applied one after each other. And a timeout can happen in each one of them, potentially delaying the reconciler further. Just for example, our normal applications result in eight to ten wait groups.
Errors
The final problem is also the worst. Sometimes it happens that a bogus resource is committed altogether. This can happen, if a declared field is immutable, for example. This has the potential to completely stop this and every successive reconciler run, effectively rendering the repository broken. Until this resource is fixed, no changes on the cluster via the reconciler are possible anymore. For these cases we do have special alerting in place, and we try to avoid them as best as possible because they have a high impact on the development process, where each test depends on these dynamic environments.
The Solution: Independent reconcilers
The solution to all the above mentioned problems is theoretically simple but with the devil in the details. The basic idea is to create a dedicated reconciler at least for every dynamic environment. Here are the problems we faced and how we overcame them.
Chicken-and-egg problem
Obviously, everything needs to be automated and must function without human interaction. And since the root-sync objects, which in turn create the reconcilers are usually applied manually, we had to devise a way to avoid a state where a reconciler deploys the root-sync object for itself.
We decided to use the original root-sync object to only consider one particular directory of our deployment-repo. Everything else can be dynamically extended, but this root-sync is still applied manually. We use the directory /infrastructure/ for this purpose. Here we describe the other static root-sync objects and their permissions. In our case, this is one directory for all ordinary applications, and one for cascaded root-sync objects of dynamic deployments.
Workload identity
To apply “external” cloud resources as well as kubernetes ones, we also have to pair each dynamic root sync with its workload identity. It basically tells the gcp permission system that the newly created kubernetes service account is allowed to do the same things the main reconciler is also allowed to do. This is no real problem, but can lead to a few seconds of delay until the permissions are in effect.
Dependencies
One major drawback of split reconcilers is that they do not know about each other. Therefore, it’s impossible to have dependencies to resources applied to other reconcilers. This restriction ultimately led us to the structure we currently have. The following directory listing shows the directory structure of our deployment repo in regards to all root-sync objects with one dynamic environment deployed:
├── dynamic-environments │ ├── a_dynamic_app │ │ └── applications │ │ └── a_dynamic_app │ │ ├── cloudstorage │ │ ├── configs │ │ │ ├── common │ │ │ └── stable │ │ ├── db-plugin │ │ └── ingress │ └── reconcilers │ └── a_dynamic_application │ └── root-sync.yaml # points to dynamic-environments/a_dynamic_app └── infrastructure └── reconcilers ├── applications │ └── root-sync.yaml # points to /applications/ └── dynamic-environments └── root-sync.yaml # points to /dynamic-environments/reconcilers/
Some dependencies, we simply had to remove. One example was the dependency of application-generated database users (in the applications reconciler) to the database instance itself (in the infrastructure reconciler).
Cleanup/deletion defender
Per design, root-syncs are not implemented to be dynamic but static and mostly never change. As a result, deleting a root-sync usually does two things:
- It leaves the deployed resources untouched and deployed.
- It also leaves additional resources behind. One example is the resourcegroup mentioned in the introduction.
For our use case, this is quite bad as it clutters the cluster after enough churn. Fortunately, the root-sync object comes with an annotation to solve this issue:
apiVersion: configsync.gke.io/v1beta1 kind: RootSync metadata: name: root-sync-test namespace: config-management-system annotations: configsync.gke.io/deletion-propagation-policy: Foreground
This annotation tells the config connector to clean up all resources before a root-sync is removed by the system. We only need to make sure that the workload-identity which allows this deletion, is properly set as an dependency and only deleted once all resources have been properly deleted.
Conclusion
With these changes we are now able to tremendously improve deployment times for our dynamic environments, which we heavily use in our development process.
Furthermore, we improve resilience of our development clusters and lift the pressure from our developers to always commit 100% working infrastructure. This in turn allows for a more agile development process.
Overall, we observe a speedup of 2-10 minutes or around half the deployment time when compared to our previous system.
