How to use Docker build cache in Gitlab CI

How to use Docker build cache in Gitlab CI

One of the convenient features of Docker's image builder is that it caches each build step of an image. It then uses this cache each time you rebuild the image unless something has changed in this step or the previous steps.

So if you, let's say, is building a python project using this Dockerfile:

FROM python:3.8-alpine

WORKDIR /app

# Install native dependencies
RUN apk add --no-cache build-base linux-headers libffi-dev

# Install python dependencies
COPY ./requirements.txt .
RUN pip install -r requirements.txt

# Copy the rest of the project
COPY . .

You won't have to reinstall native dependencies each time you build an update. And you won't have to download and compile python packages each time you change a file in your project. These steps will only run if you change the list of dependencies you use.

This optimization can be very helpful because some of the dependencies can take a long time to compile. If you build images frequently, caching significantly speeds up your building process.


...that is, unless you build in an ephemeral environment like Gitlab CI.

Docker's build cache resides on the disk of the system it's running on. Each Gitlab CI job gets a fresh clean environment that does not preserve anything written on its disk. Which pretty much defeats the purpose of the build cache. Each job starts building your images from scratch, having no access to the cache of the previous jobs.

Fortunately, there is a workaround.

Adding cache support to your Gitlab CI jobs

Since Docker 1.13 there is an option to use another image as a cache source. This image is specified using the --cache-from flag of the docker build command.

For example, if you're building an image tagged myimage:v2, you can use the previous version of the image as a cache:

docker build . --cache-from myimage:v1 -t myimage:v2

You can list several images too. You can reference an existing image of the same tag as you're building. More than that, if one of these images does not exist, the building will not fail, which can be very useful if you're building an image for the first time.

docker build . \
    --cache-from myimage:v1 \
    --cache-from myimage:v2 \
    --cache-from myimage:latest \
    -t myimage:v2

To make this feature work in Gitlab CI you have to pull one of the last versions of the image you're building. Here is an example of a job which uses Gitlab Container Registry for storing the images (I used myimage as the image tag for brevity):

build:
  stage: build
  before_script:
    - docker info
    # Login into the Gitlab's registry
    - docker login -u gitlab-ci-token -p $CI_JOB_TOKEN registry.gitlab.com
  script:
    # The "|| true" part makes the shell ignore the error if the pulled image does not exist yet.
    - docker pull myimage:latest || true
    - docker build .
      -t myimage:pipeline$CI_PIPELINE_ID  # It's handy to use pipeline's ID as an image version
      -t myimage:latest
      --cache-from myimage:latest
    - docker push myimage:pipeline$CI_PIPELINE_ID
    - docker push myimage:latest

This little feature can hopefully speed up your CI jobs.

Though sometimes Docker will rebuild your image from scratch if though you have a cache. This happens when the base image has been updated. It's shouldn't be a problem, just giving you a heads up in case you might be wondering why the cache is ignored from time to time.

Caching multi-stage builds

If you're going to use this feature for multi-stage builds, you will have to slightly modify the building process.

The image which was built through multiple stages can only be used as a cache for the last stage of the build. But even so, as of the time of this post, Docker won't actually ever use it, unless it has access to the caches of previous steps too. Because if it doesn't, it rebuilds the first stages which makes the last stage's cache unusable (because previous building steps have changed).

The trick is to build each step separately, store it in your registry, and use every image of each stage as a cache.

Let's say you're building a python project which compiles some static files using Node.js in a separate stage:

# The first stage builds static files into /app/static/dist directory.
FROM node:8 AS node

WORKDIR /app

COPY ./package.json .
COPY ./package-lock.json .
RUN npm ci

COPY ./static ./static
RUN npm run build


# The second stage build the python project and copies the static files from the previous stage.
FROM python:3.8-alpine

WORKDIR /app

COPY . .
COPY --from=node /app/static/dist ./static/dist

The first stage is called "node". You can build this stage exclusively using --target flag:

script:
  # Pull the older version of the "node" stage if it's available
  - docker pull myimage-node:latest || true
  # Build only the "node" stage
  - docker build . --target node
    -t myimage-node:pipeline$CI_PIPELINE_ID
    -t myimage-node:latest
    --cache-from myimage-node:latest
  - docker push myimage-node:pipeline$CI_PIPELINE_ID
  - docker push myimage-node:latest

Then you build the final stage, using both images as a cache:

script:
  # ...
  # Pull the older version of the complete image (i.e. the final stage)
  - docker pull myimage:latest || true
  - docker build .
    -t myimage:pipeline$CI_PIPELINE_ID
    -t myimage:latest
    --cache-from myimage-node:latest
    --cache-from myimage:latest
  - docker push myimage:pipeline$CI_PIPELINE_ID
  - docker push myimage:latest

The complete CI job script looks like this:

script:
  # Build the first stage
  - docker pull myimage-node:latest || true
  - docker build . --target node
    -t myimage-node:pipeline$CI_PIPELINE_ID
    -t myimage-node:latest
    --cache-from myimage-node:latest
  - docker push myimage-node:pipeline$CI_PIPELINE_ID
  - docker push myimage-node:latest
  # Build the final stage
  - docker pull myimage:latest || true
  - docker build .
    -t myimage:pipeline$CI_PIPELINE_ID
    -t myimage:latest
    --cache-from myimage-node:latest
    --cache-from myimage:latest
  - docker push myimage:pipeline$CI_PIPELINE_ID
  - docker push myimage:latest

This setup makes Docker cache each stage of the multi-stage build.

Conclusion

When I was setting up Gitlab CI for one of my Django projects, I had to look up how to fix these problems too, and I couldn't find one guide which would help me with each step of the process. So I have accumulated what I learned in one post and hopefully, this little note could of use to somebody else troubled with the same problem.


The cover photo is made by Simon Berger on Unsplash.