Optimizing our Build Times with Gitlab on AWS | Blog

Quick build times are essential for fast feedback. No matter if you want to deploy your version on a test environment, fix a bug (again) or just want to close your ticket – without getting your code through your CI/CD pipeline it’s not done. Tests have not been run, your metrics are not pushed to sonarqube etc. – and you do not want to wait too long because you stick to your principles of small changes and deploy often. There are various ways of decreasing your build times, especially if you use maven and docker – this blog post will only focus on which EC2 instance type to choose if you want to improve your build times.

We compared the build time of one our projects on these AWS EC2 instance types:

	VCPUS	RAM GB	Network GB/s
m4.2xlarge	8	32	n/a
m5.large	2	8	10
m5d.2xlarge	8	32	10
m5d.24xlarge	96	384	25
t2.medium	2	4	n/a
t2.2xlarge	4	16	n/a
t3.medium	2	4	5
t2.large	2	8	n/a
t3.2xlarge	8	32	5

Please keep in mind that there are also different CPU types (e.g. Intel, AMD) used on different instance families that may impact performance and for some instances e.g. values for network bandwidth are not available. Our build is not optimized to run in parallel or able to use a huge amount of resources. Therefore our assumption is that our build times do not decrease a lot with additional CPU power. Nevertheless, there are other factors like network bandwidth for dependency downloads, etc. that may take advantage. But we are engineers – let’s test it in practice.

We have chosen different instance families (e.g. general purpose vs CPU optimized) and inside these families, we have chosen two performance levels. One low end for our builds and the upper-end that’s reasonable and just for fun a m5d.24xlarge to see what’s possible.

Everything else was kept the same. The project was build several times to eliminate fluctuations due to e.g. internet traffic peaks or data center load peaks. Actually, the build times did not vary more than a couple of seconds between builds with the same instance type, therefore, the results should be pretty significant.

We are currently running about 50 builds per day and only one build per server and if needed we just spin up other instances for building. These test results may be very different depending on your usage e.g. more builds, concurrent builds per server, etc.

Our gitlab server is connected to one gitlab runner which is executed on a dedicated t2.nano instance. Its only purpose is to spin up worker nodes on which our builds and deployments take place.

Shown below is our .gitlab-ci.yml. It is not an example – it is shown as we use it in production minus our credentials. We mainly build java multi-module projects with maven that are deployed as a docker container and in every build, static code analytics with sonarqube is integrated. Finally, the resulting artifact is pushed to our docker registry and deployed on our AWS ECS cluster.

image: docker:latest

cache:
    paths:
        - maven.repository/

variables:
    DOCKER_DRIVER: overlay2
    AWS_DEFAULT_REGION: eu-central-1
    MAVEN_OPTS: -Dmaven.repo.local=maven.repository -Dsonar.branch.name=$CI_COMMIT_REF_NAME
    ARTIFACT_TYPE: docker
    ARTIFACT_DELIVERY_BUCKET: <>
    DOCKER_REGISTRY: <>

stages:
    - build
    - sonar
    - package
    - apidoc

.extract-task-param: &extract-task-param |
    export TASK_PARAM="$(jq -r 'if (.taskDefinition.executionRoleArn | length) > 0 then "--execution-role-arn \(.taskDefinition.executionRoleArn)" else "" end +
        if (.taskDefinition.networkMode | length) > 0 then " --network-mode \(.taskDefinition.networkMode)" else "" end +
        if (.taskDefinition.volumes | length) > 0 then " --volumes " + ("\(.taskDefinition.volumes)" | "'"'"'\(.)'"'"'" ) else "" end +
        if (.taskDefinition.placementConstraints | length) > 0 then " --placement-constraints \(.taskDefinition.placementConstraints)" else "" end +
        if (.taskDefinition.cpu | length) > 0 then " --cpu \(.taskDefinition.cpu)" else "" end +
        if (.taskDefinition.memory | length) > 0 then " --memory \(.taskDefinition.memory)" else "" end +
        if (.taskDefinition.requiresCompatibilities | length) > 0 then " --requires-compatibilities \(.taskDefinition.requiresCompatibilities)" else "" end' < current.json)"

mvn-build:
    stage: build
    tags: [java-test]
    image: openjdk:stretch
    script:
        - adduser --disabled-password --gecos "" user1
        - chown -R user1:user1 $(pwd)
        - su foobar -c './mvnw install'
    artifacts:
        paths:
            - target
            - '*/target'
sonar:
    stage: sonar
    tags: [java-test]
    dependencies:
        - mvn-build
    image: openjdk:stretch
    script:
        - './mvnw sonar:sonar'

build-docker-image:
    stage: package
    image: docker:latest
    tags: [java-test]
    dependencies:
        - mvn-build
    script:
        - apk update
        - apk add jq python3
        - pip3 install awscli
        - $(aws ecr get-login --no-include-email --region $AWS_DEFAULT_REGION)
        - cp starter/target/*.jar target/
        - cd target/aws
        - cp ../*.jar .
        - VERSION=$(cat version)
        - ARTIFACT_ID=$(cat artifactId)
        - docker build -t $ARTIFACT_ID:latest .
        - docker tag $ARTIFACT_ID:latest <>/$ARTIFACT_ID:$CI_PIPELINE_IID-$CI_COMMIT_SHA
        - docker tag $ARTIFACT_ID:latest <>/$ARTIFACT_ID:latest
        - docker push <>/$ARTIFACT_ID:$CI_PIPELINE_IID-$CI_COMMIT_SHA
        - docker push <>/$ARTIFACT_ID:latest
        - aws ecs describe-task-definition --task-definition $ARTIFACT_ID  > current.json
        - jq '.taskDefinition.containerDefinitions' current.json > containerDef.json
        - jq -s '[.[0][0] * .[1][0]]' containerDef.json imagedefinitions.json > mergedContainerDef.json
        - *extract-task-param
        - eval "aws ecs register-task-definition --family ${ARTIFACT_ID} ${TASK_PARAM} --container-definitions file://mergedContainerDef.json"
        - aws ecs update-service --cluster skynet --service $ARTIFACT_ID --force-new-deployment --task-definition $ARTIFACT_ID
    artifacts:
        paths:
            - target/aws/
    only:
        - master
        - development

upload-swagger-doc:
    stage: apidoc
    image: docker:latest
    tags: [java-test]
    dependencies:
        - mvn-build
    script:
        - apk update
        - apk add python3
        - pip3 install awscli
        - ARTIFACT_ID=$(cat target/aws/artifactId)
        - "./aws/export_swagger.sh ${ARTIFACT_ID}"
    only:
        - development

The Results

The figure below shows the normalized results to the fastest instance which was m5d.2xlarge – a value of e.g. 118% in the figure below means it was 18% slower than the fastest one.

The results are closer than we expected with a maximum decrease in build time of around 20%. At first, this seems a nice performance gain, but let’s compare it to the costs. The table below shows costs per hour normalized to the fastest m5d.2xlarge instance. The costs increase a huge margin and are, of course, not linear related to the reduction in build times.

	Costs normalised to m5d.2xlarge
m4.2xlarge	86,64 %
m5.large	20,76 %
m5d.2xlarge	100,00 %
m5d.24xlarge	1.178,34 %
t2.medium	9,68 %
t2.2xlarge	77,40 %
t3.medium	8,66 %
t2.large	19,35 %
t3.2xlarge	69,31 %

We have shown that performance improvements of < 20% compared to the costs are not very satisfying. One percent decrease in build times, in this setup, would cost us five times the money. Let’s try something different:

Enable caching

Gitlab allows the usage of caching between builds and even between runners (see https://docs.gitlab.com/runner/configuration/advanced-configuration.html). We enabled the shared cache on an S3 bucket. This is done in a couple of lines in your gitlab configuration file on your runner:

[runners.cache.s3]
      ServerAddress = "s3.amazonaws.com"
      AccessKey = ""
      SecretKey = ""
      BucketName = "your-bucket-name"
      BucketLocation = "eu-central-1"

After restarting our runner to apply the new config we did a couple of tests with the results shown in the figure below.

There are significant performance improvements with up to 75% reduction in build time on certain use cases e.g. rebuilding/redeploying a project. The more interesting use case, a small code piece changed, also shows a reduction of about 25%. Compared to a larger instance the costs of caching a couple of hundred MB of data in S3 are negligible.

Please keep in mind that the actual build times may vary depending on what you have changed in your commit and what has to be built. In our example, we changed a java code file to trigger a new build.

Summary

Building projects and deploying projects depends on different aspects like I/O, CPU & network bandwidth as our results show. For our use case, increasing these resources only improve build times up to 20% but with a huge downside of costs which vary up to 100 times the cheapest instance. Our current setup just does not take any advantage of the additional provided resources. There are features like maven concurrent builds that may use these capacities in a more efficient way but that’s out of scope for this post.

Focusing on other steps of the build e.g. enabling caching or providing pre-build docker images has far more impact on build times. We will stick with the cheapest possible instance as it has the best performance/price ratio and is sufficient combined with other steps to get reasonable build and deployment times. During working ours our instances will be running to prevent having cold starts and they are automatically scaled up (horizontally) and shut down after working hours which gives us the best result in terms of performance and costs so far. #elasticInfrastructureFTW