DIY Containers: Building a Lightweight Container System from Scratch on Linux

In today’s cloud-native world, containers have revolutionized how we deploy and manage applications. While Docker and Kubernetes dominate the landscape, understanding what’s happening under the hood can be valuable. This blog post will guide you through building a minimalist container system from scratch using Linux’s namespaces and other kernel features.

What Are Containers, Really?⌗

At their core, containers are just processes running with isolation features provided by the Linux kernel. Unlike virtual machines, containers don’t need a separate OS kernel - they share the host’s kernel while maintaining isolation through several key technologies:

Namespaces: Provide isolation for system resources
Control Groups (cgroups): Limit and account for resource usage
Chroot: Change the root directory for a process
Mount Points: Control what filesystems are visible

Before we begin⌗

Before jumping to the code, you need a filesystem for your container. You can create one from a base distribution:

# Create a directory for your rootfs
mkdir -p mycontainer/rootfs

# Use debootstrap to create a minimal Debian system
sudo debootstrap --variant=minbase bullseye containers/mycontainerr/rootfs

# OR Download the Arch Linux rootfs
curl -O https://mirrors.edge.kernel.org/archlinux/iso/latest/archlinux-bootstrap-x86_64.tar.zst
sudo tar xf archlinux-bootstrap-x86_64.tar.zst -C mycontainer/rootfs --strip-components=1 

# OR Alpine (Lightweight)
wget https://dl-cdn.alpinelinux.org/alpine/v3.14/releases/x86_64/alpine-minirootfs-3.14.0-x86_64.tar.gz
tar -xzf alpine-minirootfs-3.14.0-x86_64.tar.gz -C containers/mycontainerr/rootfs

The rootfs is a directory containing the root filesystem of our container. It contains the necessary files and directories to run a basic system.

Let’s begin⌗

We’ll start with the necessary includes and definitions, yes we are doing it in c :)

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mount.h>
#include <sys/wait.h>
#include <string.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/stat.h>

#define STACK_SIZE (1024 * 1024)
static char child_stack[STACK_SIZE];

char *child_args[] = { "/bin/sh", NULL };

These includes provide access to necessary system calls for creating our container. We also define a stack size for our container process and set up the default command to run inside our container (/bin/sh). If you know docker that last line should be familiar to you.

Creating the Container Process⌗

Next, we need to define what happens inside our container. The child_main function will be executed inside the container namespace:

int child_main(void *arg) {
    // Set container hostname
    sethostname("minicontainer", 10);
    
    // ... more container setup to come
    
    return 1;
}

Setting the Hostname⌗

The first step in our container setup is to set a hostname. This is important because it helps identify the container environment:

sethostname("minicontainer", 10);

This sets the hostname to “container” with a maximum length of 10 characters. The hostname is isolated because we’re using the CLONE_NEWUTS namespace flag when creating the container.

Changing Root Directory⌗

Now, we need to isolate the filesystem by changing the root directory using chroot():

if (chroot("/path/to/mycontainer/rootfs") != 0 || chdir("/") != 0) {
    perror("chroot/chdir");
    return 1;
}

This changes the root directory to “/path/to/mycontainer/rootfs” and let us in there. After changing the root, we also change the current directory to the new root with chdir("/"). If you still wonder what roofs directory is then you can think it as a OS image.

Setting Up /proc Filesystem⌗

To ensure our container has access to process information (the thing we do using ps command), we need to mount a /proc filesystem:

// Make sure /proc exists
mkdir("/proc", 0555);

// Mount /proc
if (mount("proc", "/proc", "proc", 0, "") != 0) {
    perror("mount /proc");
    return 1;
}

First, we create the /proc directory with appropriate permissions (if it doesn’t already exist). Then we mount the proc filesystem, which gives processes in the container access to information about running processes within their namespace. This namespace creates the isolation.

Executing the Container Command⌗

Finally, we execute the command that will run inside our container:

// Execute shell
execv(child_args[0], child_args);
perror("exec");
return 1;

This replaces the current process with the specified command (by default, /bin/sh). If the execv call fails, we print an error and return.

Setting Up the Main Function⌗

Now let’s look at the main function that will create our container:

int main(int argc, char *argv[]) {
    
    int flags = CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | SIGCHLD;
    printf("Launching container...\n");

    // ... container creation code to come
    
    return 0;
}

Defining Namespace Flags⌗

Now we define the namespace flags that will determine what isolation features our container will have, read it again:

int flags = CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | SIGCHLD;

Each flag provides different isolation:

CLONE_NEWUTS: Isolates hostname and domain name
CLONE_NEWPID: Gives the container its own process ID namespace
CLONE_NEWNS: Creates a new mount namespace
CLONE_NEWNET: Isolates the network stack
SIGCHLD: Signal to send when child terminates

Creating the Container Process⌗

Finally, we create the container process using the clone() system call:

pid_t pid = clone(child_main, child_stack + STACK_SIZE, flags, NULL);
if (pid < 0) {
    perror("clone");
    return 1;
}

waitpid(pid, NULL, 0);

The clone() system call creates a new process that runs the child_main function with the specified namespace flags. We pass it the stack we defined earlier and wait for the container process to finish with waitpid().

Now our container is ready to go!

c container.c


#define _GNU_SOURCE
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)
static char child_stack[STACK_SIZE];

char *child_args[] = {"/bin/sh", NULL};

int child_main(void *arg) {
  // Set container hostname
  sethostname("minicontainer", 10);

  if (chroot("/home/stroky/.local/codes/Dockers/custom_container/containers/"
             "mycontainerr/rootfs") != 0 ||
      chdir("/") != 0) {
    perror("chroot/chdir");
    return 1;
  }

  mkdir("/proc", 0555);

  // Mount /proc
  if (mount("proc", "/proc", "proc", 0, "") != 0) {
    perror("mount /proc");
    return 1;
  }
  // Execute shell
  execv(child_args[0], child_args);
  perror("exec");
  return 1;

  return 1;
}

int main(int argc, char *argv[]) {

  int flags = CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | SIGCHLD;
  printf("Launching container...\n");
  pid_t pid = clone(child_main, child_stack + STACK_SIZE, flags, NULL);
  if (pid < 0) {
    perror("clone");
    return 1;
  }

  waitpid(pid, NULL, 0);
  return 0;
}

Building and Running the Container⌗

Compile the code:

gcc -o container container.c

Run it:

sudo ./container

You should now be in a shell inside your container with its own isolated environment!

❯ sudo ./container
Launching container...
sh-5.2#
sh-5.2# ls
bin   dev  home  lib64	opt   root  sbin  sys  usr  version
boot  etc  lib	 mnt	proc  run   srv   tmp  var
sh-5.2#

What this container provides?⌗

Our minimal container provides:

Process Isolation: Processes inside the container can’t see processes outside
Filesystem Isolation: The container has its own root filesystem
Hostname Isolation: The container has its own hostname
Network Isolation: The container has its own network namespace

Now, this is a minimalist implementation, but a production-grade container systems also include:

cgroup support to limit CPU and memory usage
Implementation of user namespace isolation (CLONE_NEWUSER)
Bridge network interface for container connectivity
Support for mounting volumes from the host

Conclusion⌗

Building a container system from scratch helps in understanding what’s going on under the hood. Though our implementation is basic, it reflects the core that underlie all container systems, including Docker and containerd.

The Linux kernel provides all the building blocks we need like namespaces, chroot, and mount points - to create isolated environments for running applications.

So that’s it.