Under the hood of Docker

We have completed the initial warm-up level and have gained some familiarity with Docker at a basic level. Now, it's time to take a deeper dive into Docker and explore the underlying technologies that power it.

Recap

Let's take a look at a commonly used design that illustrates the differences between virtualization technology and Docker.

Docker Vs. VM

Building upon our previous hands-on experience, we observed that container functionality bears a striking resemblance to that of a virtual machine (VM). Try to deploy an Ubuntu container again:

docker run -it ubuntu

The deployment and execution of Ubuntu were remarkably fast!
Run uname -a locally and in container. Compare ouputs!

We can observe that the prompt shell has changed to root@<id>, indicating that it is no longer the local prompt of your system. we can check all the running process. There is only one running process which is the current bash with the PID=1.

ps aux

Run the same command locally. Which command is running with PID=1?

Check the network interfaces:

ip a

The ip command is not installed by default, so run apt update, and then apt install iproute2.

There are only two ethernet interfaces; LO(Loopback) and eth#. You can see that eth# ip range is different from your local host ip range!

As a final observation, we are installing the Nginx web server and run it in Linux service:

apt update
apt install nginx -y
service nginx start

and check the service status:

service nginx status

Did you see that we did not use systemctl command to run the Nginx service?

If we deploy an Ubuntu VM on VirtualBox (or any other virtualization platform), we would have almost the same functionalities as the above Ubuntu container. In a VERY high-level approach, containers can be considered as lightweight VMs (Docker experts are shouting now).

Low-level approach

If you check the hints in the above tests, you can see that a Docker container uses the host kernel and doesn't need init as PID=1. So, although a Docker container's functionality is similar to a VM, a container is just a process on its host.

Run docker ps and find the ubuntu container id. Now, in your local machine, run ps aux | grep <container_id>.

Docker utilizes several Linux functionalities under the hood. In the following section, we will briefly walk through some of those.

Capabilities

For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero). Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials (usually: effective UID, effective GID, and supplementary group list).

Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute. (Ref: Linux man page)

Capabilities apply to both files and threads. File capabilities allow users to execute programs with higher privileges. This is similar to the way the setuid bit works. Thread capabilities keep track of the current state of capabilities in running programs.

In an environment without file based capabilities, it’s not possible for applications to escalate their privileges beyond the bounding set (a set beyond which capabilities cannot grow). Docker sets the bounding set before starting a container. You can use Docker commands to add or remove capabilities to or from the bounding set. By default, Docker drops all capabilities except [those needed] using a whitelist approach. Learn more about Docker and Linux capabilities here.

Cgroups

Control groups, usually referred to as cgroups, are a Linux kernel feature which allow processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored. The kernel's cgroup interface is provided through a pseudo-filesystem called cgroupfs. Grouping is implemented in the core cgroup kernel code, while resource tracking and limits are implemented in a set of per-resource-type subsystems (memory, CPU, and so on).

A cgroup is a collection of processes that are bound to a set of limits or parameters defined via the cgroup filesystem. A subsystem is a kernel component that modifies the behavior of the processes in a cgroup. Various subsystems have been implemented, making it possible to do things such as limiting the amount of CPU time and memory available to a cgroup, accounting for the CPU time used by a cgroup, and freezing and resuming execution of the processes in a cgroup. Subsystems are sometimes also known as resource controllers (or simply, controllers).

The cgroups for a controller are arranged in a hierarchy. This hierarchy is defined by creating, removing, and renaming subdirectories within the cgroup filesystem. At each level of the hierarchy, attributes (e.g., limits) can be defined. The limits, control, and accounting provided by cgroups generally have effect throughout the subhierarchy underneath the cgroup where the attributes are defined. Thus, for example, the limits placed on a cgroup at a higher level in the hierarchy cannot be exceeded by descendant cgroups. (Ref: Linux man page)

docker ps
ps aux | grep <container-id>
cat /proc/[pid]/cgroup

For each cgroup hierarchy of which the process is a member, there is one entry containing three colon-separated fields:

hierarchy-ID::cgroup-path

hierarchy-ID: For the cgroups version 2 hierarchy, this field contains the value 0.
cgroup-path: This field contains the pathname of the control group in the hierarchy to which the process belongs. This pathname is relative to the mount point of the hierarchy.

docker info | grep containerd

More info about containerd: https://containerd.io/

Containerd

more details about Linux cgroups:

systemd-cgls

systemd-cgtop

docker stats

Learn more about Docker and Linux Cgroups here.

Namespaces

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources.

Starting from kernel version 5.6 onwards, there exist eight different types of namespaces. Regardless of the type, the functionality of namespaces remains consistent. Each process is linked to a specific namespace, restricting its access to only the resources associated with that particular namespace and any relevant descendant namespaces. As a result, every process or group of processes can have a distinct perspective on the available resources. The specific resource that is isolated depends on the type of namespace created for a particular process group. The below list shows different type of namespaces:

Mount (mnt)
Process ID (pid)
Network (net)
Inter-process Communication (ipc)
UNIX Time-Sharing (UTS)
User ID (user)
Control group (cgroup)
Time Namespace

To see the namespace example, pull the NextGenBTS project: (https://github.com/meraj-kashi/NextGenBTS), navigate to dev/namespace/network. Build the project:

gcc -o network_namespace network_namespace.c

run the code and check ethernet interfaces:

./network_namespace

Seccom

Seccomp, short for Secure Computing Mode, is a feature in the Linux kernel that enhances multiple security aspects, providing a more secure environment for running Docker. More information here.

AppArmor

AppArmor is a security module integrated into the Linux kernel, which enables system administrators to impose restrictions on the capabilities of programs through per-program profiles. These profiles define what actions a program is allowed to perform, such as granting network access, raw socket access, and read, write, or execute permissions for specific file paths. Check this blog for more information.

Demo

#First mount a btrfs disk:
mkfs.btrfs /dev/sdb1
mkdir /container
mount /dev/sdb1 /container
mount | grep btrfs

mount --make-rprivate / 
mkdir -p images containers
btrfs subvol create images/alpine
wget https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.0-x86_64.tar.gz
tar -C images/alpine/ -xf alpine-minirootfs-3.18.0-x86_64.tar.gz
btrfs subvol snapshot images/alpine/ containers/nextgenbts
chroot containers/nextgenbts/ sh
exit
unshare --mount --uts --ipc --net --pid --fork bash
hostname nextgenbts
exec bash
ps
pidof unshare
kill <pid>
mount -t proc none /proc
ps aux
ip a