Reducing costs in cloud environments | Blog

Cloud infrastructures come with a multitude of advantages when compared to on premise hardware. Apart from the added flexibility, resilience and agility when designed correctly, cloud infrastructure also bears the possibility to optimize costs based on actual usage, shared resources and free tier options. In this article we will show you how we achieved a 66% cost reduction compared to a previously implemented cloud solution.

Background
Before we go into detail about cost saving measures we need to discuss core technologies that facilitate our approach. Undoubtedly, the most important technology we use is GitOps. It essentially allows us to describe cloud infrastructure in a Git repository. Changes are automatically applied by a so-called reconciler, keeping the provisioned cloud resources in sync with the git-based description. More specifically, we use Gitlab for our repositories and Google Config Connector for applying changes. For more details on how we use both, we describe them from a developer’s point of view in [1]. Several other implementations to facilitate GitOps for other cloud providers exist. Regardless, they all aim to introduce a high level of automation compared to manually created infrastructures.

Secondly, it is important to note that we migrated our previously used Amazon infrastructure (AWS) almost exclusively to the Google Cloud Platform (GCP). One reason was that the above mentioned Config Connector only works in Google infrastructure. Furthermore, hybrid cloud solutions are not optimal from a cost perspective because of several reasons, for instance cross-cloud traffic, shared resources and duplicate base infrastructure like networks.

Kubernetes optimizations
All services implemented by viesure use a containerized approach and are now running in a kubernetes environment. We use one kubernetes cluster per stage (currently stable and unstable). Each stage, as well as other clusters (infrastructure, testing, sandboxes, etc.) are encapsulated in their own Google project. This is especially important because Google provides free-tier limits for a number of services. We try to stay within these limits as often as possible. With that in mind, here are some pointers on how to decrease kubernetes-caused costs.

Autoscaling
An obvious method for controlling kubernetes compute engine costs is to heavily use (node) autoscaling. This obviously only makes sense in a cloud environment where nodes can be spawned and killed on short notice, and are billed only when used. On AWS-based kubernetes clusters (EKS – Elastic Kubernetes Service), the autoscaler can be tweaked and managed by the user, while GCP based clusters (GKE – Google Kubernetes Engine) provide this as a transparent service. It merely allows for a Balanced or Optimize Utilization setting. The general idea is simple: The autoscaler takes a look at the allocated resources within a cluster and compares it to the available resources, which is a sum of all currently available nodes. If more than a node’s worth of resources (e.g. memory and CPU) is dormant, a node can be shut down. If not enough resources are available, a new node must be spawned. However, there are certain factors to consider regarding the effectiveness of the autoscaler.

Pool number and pool size
A cluster can have multiple pools of nodes with different sizes (e.g. high memory, high cpu, larger instances etc..). We found that autoscaling effectiveness suffers immensely when multiple classes of nodes are involved. Our infrastructure cluster, for instance, consists of four different pools:

Instance Type	CPU/Instance	Memory/Instance	Purpose
e2-highmem-2	2 Cores	16 GB	Artifacts, DB
e2-medium	1 Core	4 GB	Essential infra
e2-standard-2	2 Cores	8 GB	General Purpose
e2-standard-8	8 Cores	32 GB	Gitlab runners

If left unchecked, workloads get distributed evenly and often in an undesired way. An example would be that a high memory machine gets populated by ordinary workloads, and when a high memory workload needs to be scheduled, a new node needs to be spawned, resulting in an overprovisioned, and thus more expensive cluster. This problem can easily be solved by introducing node taints and labels, while assigning taint tolerations and node affinity to the workload that should run on non general purpose pools.

Allowing pod eviction
One thing to keep in mind when utilizing the autoscaler is the fact that it can only move workloads if they are either running redundantly (2+ replicas) or are marked with the safe-to-evict annotation:

annotations:
  	cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

Node size
Finally, the size of a node also plays a role. Smaller nodes can be filled better, because they can host fewer workloads. However, they also produce the most overhead. As an example, the e2-medium instances from the table above, exposes 2.94 GB of memory of its available 4.11 GB for workloads. That is 71.5%. The rest is used by the node operating system and kubernetes system. The same is true for CPU. Out of the 0.940 Cores available on the machine, the daemon sets (workloads spawned on each node) already occupy 0.335 Cores. Larger machines have less of this overhead, but if the cluster is just 0.1 Cores short of resources, the new node must still be spawned, potentially resulting in sparsely populated large nodes. For us, the machine types shown in the table above represent the best trade-off.

Preemptible nodes / Spot instances
An extremely easy way to save a buck on compute resources is to create the instances within a pool as Spot instances (formerly called preemptible in GCP). These are machines that are automatically terminated after some time, but save up to 60% of the compute engine costs. Especially in kubernetes environments with a functioning autoscaler, this is a very attractive option to reduce a company’s spendings. Plus, it comes with the added benefit of implicitly testing applications for restart resilience.

We almost exclusively use preemptible machines in all our clusters. The only exception are our GitLab runners. When they restart unexpectedly, they lose their connection to currently executed builds. Therefore, the GitLab runners are scheduled on very small regular nodes (the e2-medium machines to be specific).

Cluster shutdown
Another huge part of our saving approach is a scheduled cluster shutdown during the night and on weekends. We mentioned before that our infrastructure is 100% GitOps-driven. As a result, deleting all nodes of our development clusters is just a matter of running a GitLab pipeline. For our late-workers, we additionally provide the possibility to scale up the cluster even during the night, if that is desired.

An interesting difference between EKS and GKE is that both handle the control plane quite differently. An example would be the autoscaler pod, which is responsible for scaling nodes. On GKE, this is an intrinsic part of the cluster itself and contained within the transparent control plane of the cluster. On EKS, it is a simple workload and therefore subject to eviction or termination processes. As a result we sometimes end up with a stuck cluster because the autoscaler was not executed or waiting for resources to run.

Previously, GKE even had free tier billing for the control plane itself, resulting in zero cluster costs when no nodes were scheduled. Unfortunately, this feature was exploited by using the cluster as a free datastore (by encoding images in kubernetes secrets for instance). Consequently, Google had to introduce a baseline cost for the cluster itself.

To give you an idea on the proportions of our cluster shutdown, here are the cluster costs of a Friday from one of our test stages, compared to a Saturday:

The blue section shows the cluster costs, while red and yellow are compute engine costs during the day. On workdays, our clusters start up at 05:00 in the morning and shut down at 22:00. The only additional costs are caused by the load balancer in front of the cluster marked in green.

Requested service resources
The last thing to discuss in respect to kubernetes are allocated service resources. They ultimately decide how many nodes are necessary to run the cluster’s workloads. What sounds intuitive at first glance often turns out as a very controversial topic. First, let us give some definitions about assigned workload resources.

CPU
Workload CPU can be allocated in two ways: Requested CPU and CPU Limits. It is usually given in cores or millicores. The amount of requested CPU decides if a workload can be scheduled on a given node, while the limit actually restricts a workload from using more than requested. Since we are dealing mostly with development environments, we tend to keep the requested CPU very low (around 10-100 millicores) and set a limit for around 1-2 cores. This has the positive effect that nodes (which have around 2-4 Cores in total) can fit multiple workloads, which are normally idle, and enough leeway to provide short bursts for service startup, for instance. However, it also comes with a disadvantage. In cases where lots of new workloads are spawned (e.g. test runs with multiple services), the cluster starts in a state where it is quite full already, if the autoscaler did its work. Then, to fit the new workloads, a new node is spawned, which contains all of the new workloads. Now all services start up at the same time on one single new node, quickly saturating it with CPU load, while all other nodes in the cluster are running idle. We manage to cope with this problem quite satisfactorily by providing large enough nodes, but there will always be a trade-off between performance and utilization.

Memory
The situation when assigning memory is even more complicated. For Requested Memory and Limits, the same rules as for CPU apply. However, setting a higher limit than what is requested quickly leads to terminated workloads. Other than CPU, Memory cannot be overprovisioned. When more CPU is used than what is actually available, processes simply get congested. In contrast, the maximum truly available memory is set by the available node memory. If it is exceeded, the node’s operating system must terminate a process, resulting in random workload terminations which are hard to debug. Therefore, we set the same memory limit and requests for our services to avoid this issue as well as possible. One can imagine, however, that, especially in Java services, the actually needed amount of memory for a service is not obvious and can differ hugely between development and production. It is ultimately the responsibility of the programmer to decide this.

In general, resource allocation in autoscaled kubernetes clusters is a tough problem and justifies a whole article dedicated to it.

Other optimizations

Free tiers
As already mentioned in the introduction, Google provides a lot of services in a free tier option. The current prices can be found in [2]. These limits are mostly project-based, which means that they are available for every project and contribute to lower costs if multiple projects are used. We actively use and try to stay within the free tiers of the following services:

Cloud logging. We barely exceed the free tier limit in only one project, making this service virtually free
Publish-Subscribe service
Cloud monitoring
Cloud Run

Networking
When dealing with networks, there is one important credo: “Stay local!”. The narrative is very simple. As long as local networks are used, traffic can be handled by the cloud provider. As soon as there is egress traffic, the traffic is public and has to be paid for by the cloud provider as well.

We use one single, centrally managed network (VPC) which is shared between all projects. This way, we can even use cross-cluster traffic via LAN connections, which are virtually free.

On a side note, egress traffic is also the reason why multicloud setups do not pay off for us. The cost for a side-to-side tunnel with the incurred traffic cost is simply too high.

Storage
When it comes to persistent storage, the price progression is quite linear. That means that faster storage is proportionally more expensive than slower storage. It boils down to a correct estimate of the required storage speed and provisioning the resources accordingly. One tiny bit of optimization is more a bug than anything else: While the minimum size of a disk is 10 GB when created in the Web frontend, google config connector allows the creation of smaller disks down to 1 GB.

An advantage of GCP based storage disks is the backup strategy. GCP uses incremental snapshots. And most services do not turn around the whole disk regularly. Therefore, snapshots of large volumes often have a small size compared to the disk size. As a result, it is affordable to create hourly snapshots because they amount to a not much larger total compared to daily ones. Also, the retention time can be extended without impacting the price likewise. Google only bills actually used storage. This is a huge advantage compared to AWS, where disk snapshots are full copies of the whole disk.

Shared services
Finally, we try to use a single instance and share resources where possible. An example is our postgres database. We only run one instance per project, where databases and users are automatically added or deleted as necessary. It also bears the advantage that all services profit from a potential switch to a higher tier simultaneously. Again, it comes with a slight drawback which is backup and restore in this case. To restore one postgres database, for example, the whole CloudSQL instance needs to be restored.

Future Ideas
Finally, to satisfy our inner nickel nurser, we came up with some ideas on how to further decrease costs.

Since all of our workloads are containerized, we thought about migrating them to Cloud Run. For those unfamiliar, Cloud Run is a google-provided service to host containers. The major difference to kubernetes is that these containers are only billed when used. More specific, when they handle incoming requests. And most of our containers are sitting idle anyways. Recently, Google introduced the possibility to attach such a Service to a shared VPC, which is a hard requirement for our setup.

However, to successfully use this approach, we would need to figure out some caveats like persistence, ingress, service discovery, etc. first. Still, this would be an intriguing way to run an environment.

Conclusion
By implementing the optimizations mentioned above, we were able to cut our cloud costs by around 60% in total. The most important factors are kubernetes optimizations, free tier usage and shared cloud resources. We achieved this with minimal inconveniences. Another positive side-effect is the reduced carbon footprint which comes along with the reduction in wasted resources.