I wrote this post trying to learn how Docker works under the hood. My learning goal was to run a Docker image without Docker.

tl;dr: Surprisingly, Docker is not magic. Docker uses Linux cgroups, namespaces, overlayfs and other Linux mechanisms. Below I try to use those mechanisms by hand.

To reproduce the learning steps, clone no-docker git repo and follow the post and run the scripts. I used Debian run from VirtualBox. Start with running 00-prepare.sh to install all the dependencies and build a small tool in Go that we will use for experimenting.

00-prepare.sh

#!/bin/bash
set -eux
sudo apt-get install -y git golang jq curl psmisc
curl -O https://raw.githubusercontent.com/moby/moby/master/contrib/download-frozen-image-v2.sh
chmod a+x download-frozen-image-v2.sh
go build -o tool tool.go

Docker image

Let’s download and un-archive busybox image by running 10-busybox-image.sh. You can see that a Docker image is just a nested tar archive:

10-busybox-image.sh

#!/bin/bash

set -eux
set -o pipefail

./download-frozen-image-v2.sh ./image-busybox/ busybox:latest
mkdir -p image-busybox-layer
find image-busybox -name layer.tar | xargs -n1 tar -C image-busybox-layer -xf
$ tree image-busybox
image-busybox
|-- a01835d83d8f65e3722493f08f053490451c39bf69ab477b50777b059579198f.json
|-- b906f5815465b0f9bf3760245ce063df516c5e8c99cdd9fdc4ee981a06842872
|   |-- json
|   |-- layer.tar
|   `-- VERSION
|-- manifest.json
`-- repositories

layer.tar is a file tree with busybox tooling:

image-busybox-layer/
|-- bin
(...)
|   |-- less
|   |-- link
|   |-- linux32
|   |-- linux64
|   |-- linuxrc
|   |-- ln
(...)
|-- etc
|   |-- group
(...)

namespace magic

Linux namespaces create a separate “view” on Linux resources, such that one process can see the resources differently than other resources. The resources can be PIDs, file system mount points, network stack, and others. You can see all the current namespaces with lsns. Let’s see how isolating and nesting PIDs look in practice with PID namespace.

unshare system call and a command allows to set the separate namespace for a process. Run 20-unshare.sh to fork a shell from busybox with a separate PID namespace, with a separate file system root.

20-unshare.sh

#!/bin/bash

set -eux

cd image-busybox-layer
mkdir -p proc

sudo unshare --mount-proc \
    --fork \
    --pid \
    --cgroup \
    --root=$PWD \
    bin/sh

Have a look around. You will see that the root directory of the forked process is restricted (“jailed”) to the directory we specified when forking the shell. Now run the tool and see how the same process looks from the “inside” and “outside” of the forked shell. First copy the tool to image-busybox-layer/, then run the tool from the forked shell:

# Run from the forked shell.  It does nothing but sleep.

./tool -hang hello &

Restricting a directory tree of a process to a subdirectory is done with chroot. You can check the actual root directory by checking /proc/*/root of processes:

# Run this from the parent (outside) shell

dev@debian:~/no-docker$ find  /proc/$(pidof tool) -name root -type l 2>/dev/null | sudo xargs -n1 ls -l
lrwxrwxrwx 1 root root 0 Aug 27 22:03 /proc/1985/task/1985/root -> /home/dev/no-docker/image-busybox-layer
(...)

You can also see how the PID namespaces work. The tool in the parent shell and in the forked shell have separate PID numbers. Also, the parent shell sees the processes run in the forked shell, but not vice-versa.

# from the forked shell
/ # ps aux | grep '[t]ool'
    7 root      0:00 ./tool -hang hello
# from the parent shell
dev@debian:~$ ps aux | grep '[t]ool'
root       464  0.0  0.2 795136  2724 pts/1    Sl   10:16   0:00 ./tool -hang hello

cgroups, limiting resources

While namespaces isolate resources, cgroups (control groups) put limits on those resources. You can find the control group of our hanging tool with the following, run from the parent shell:

dev@debian:~$ cat /proc/$(pidof tool)/cgroup
0::/user.slice/user-1000.slice/session-92.scope

Let’s now use cgroups to see how we can cap memory of the forked shell.

First, run the tool with -mb option to make it allocate n MBs of memory:

# kill the previous tool if it still runs
killall -9 tool
./tool -mb 200

Find the file controlling the maximum memory of the tool process:

find /sys/fs/cgroup/ | grep $( cat /proc/$(pidof tool)/cgroup | cut -d/ -f 2-) | grep memory.max
/sys/fs/cgroup/user.slice/user-1000.slice/session-92.scope/memory.max

“/sys/fs/cgroup” is a mount point for cgroups file system:

mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

“memory.max” is a memory hard limit in the memory controller. Passing the hard limit causes OOM when memory usage cannot be reduced (more about it in a while).

Let’s put 100MB limit:

sudo sh -c 'echo 100m > /sys/fs/cgroup/user.slice/user-1000.slice/session-92.scope/memory.max'

You will notice that the tool process… was not killed. How come? if you inspect memory.events file, you will see that “max” entry increments.

cat /sys/fs/cgroup/user.slice/user-1000.slice/session-92.scope/memory.events

low 0
high 0
max 3534 << this changes when you run over the max limit
oom 0
oom_kill 0

The process was not killed because OS swapped the excessive memory. Check cat /proc/swaps, print it several times to see how it changes:

dev@debian:~/no-docker$ while [ 1 ]; do cat /proc/swaps; sleep 2; done
Filename				Type		Size		Used		Priority
/dev/sda5                               partition	998396		2372		-2
Filename				Type		Size		Used		Priority
/dev/sda5                               partition	998396		2372		-2

# here I run the tool, you can see how the memory is swapped

Filename				Type		Size		Used		Priority
/dev/sda5                               partition	998396		103860		-2
Filename				Type		Size		Used		Priority
/dev/sda5                               partition	998396		121540		-2
Filename				Type		Size		Used		Priority
/dev/sda5                               partition	998396		116604		-2

If you turn the swapping off with swapoff, the tool will be OOM-killed.

sudo swapoff -a
2022/09/10 06:32:38 heap 0 mb, sys 218 mb
2022/09/10 06:32:39 allocate 200MB of memory
Killed

overlayfs

The last thing I looked at is the overlay file system, underlying volumes in Docker. The overlay file system allows logically merging of different mount points. You can overlay part of a parent file system with the forked file system. You can check the overlayfs with the following:

40-overlayfs.sh

#!/bin/bash

set -eux  

sudo mkdir -p /upper /lower /work /merged
sudo chmod 777 /upper /lower /work /merged
echo 'upper foo' > /upper/foo
echo 'upper bar' > /upper/bar
echo 'lower bar' > /lower/bar
echo 'lower quux' > /lower/quux
sudo mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,workdir=/work /merged 

See how the /merged directory holds the content of both upper and lower directory, where “upper wins” if there are files with similar names:

dev@debian:~/no-docker$ tail -n+1 /merged/*
==> /merged/bar <==
upper bar

==> /merged/foo <==
upper foo

==> /merged/quux <==
lower quux

Worth noting that the workdir is a “technical” directory used by overlayfs to prepare files to move them in a single atomic operation.

Conclusion

Docker itself is not magic, the mechanisms of the kernel are the magic, and you can easily explore those mechanisms yourself. The one important part I didn’t cover here is the networking namespace.

(on HN)