Namespaces: The Seven Kingdoms
Linux Namespaces are the fundamental isolation primitive of containers. They wrap a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.
While there are several namespaces, seven are crucial for containers.
The Big 7
| Namespace | Flag | Isolates | Purpose in Containers |
|---|---|---|---|
| User | CLONE_NEWUSER | User & Group IDs | Allows a process to run as root inside while being a normal user outside. |
| Mount | CLONE_NEWNS | Mount points | Provides a separate filesystem view (the rootfs). |
| PID | CLONE_NEWPID | Process IDs | Ensures the container’s init process sees itself as PID 1. |
| Network | CLONE_NEWNET | Network stack | Gives the container its own IP, localhost, and routing table. |
| UTS | CLONE_NEWUTS | Hostname | Allows the container to have its own hostname. |
| IPC | CLONE_NEWIPC | IPC resources | Prevents shared memory attacks between host and container. |
| Cgroup | CLONE_NEWCGROUP | Cgroup root | Isolates the view of cgroup hierarchy (less commonly used directly). |
The User Namespace: The Key to Rootless
The User Namespace is special. It is the only namespace that an unprivileged user can create without sudo.
When creating a new User Namespace:
- I become UID 0 (root) inside that namespace.
- I gain a full set of Capabilities (like
CAP_SYS_ADMIN,CAP_NET_ADMIN) but only within that namespace.
This is the magic trick:
“To perform privileged operations (like mounting filesystems or configuring networks), I don’t need to be real root. I just need to be root inside a User Namespace.”
Example: Becoming Root (Fake)
Try this in your terminal:
# -U: User Namespace
# -r: Map current user to root inside
unshare -U -r whoami
Output:
root
You are now root! (Well, technically, you are root inside that ephemeral namespace).
Manipulating Namespaces
I use two primary syscalls (wrapped by command-line tools) to manage namespaces.
1. unshare (Create)
The unshare command creates new namespaces and then executes a program inside them.
In my rootless script:
unshare --user --fork --map-root-user bash
This creates a new user namespace and runs bash inside it.
2. nsenter (Join)
The nsenter command allows an existing process to “enter” the namespaces of another process. This is how docker exec works.
Every process has a directory in /proc/[pid]/ns/ containing links to its namespaces:
ls -l /proc/$$/ns/
# lrwxrwxrwx 1 vagmi vagmi 0 Jan 12 10:00 net -> 'net:[4026531992]'
# lrwxrwxrwx 1 vagmi vagmi 0 Jan 12 10:00 user -> 'user:[4026531837]'
To join a container, I target its PID:
nsenter --target $CONTAINER_PID --mount --net --pid /bin/sh
The Order of Operations
In my rootless implementation, the order matters immensely:
- Create User Namespace: I must do this first to gain the privileges needed to create the others.
- Become Root (Inside): I map my host UID to root.
- Create Other Namespaces: Now that I am “root”, I can create Mount, Network, and PID namespaces.
- Mount Filesystems: I mount
/proc,/sys, and myoverlayfsroot.
If I tried to create a Mount namespace (unshare -m) before the User namespace, the kernel would deny me because I am not real root.