Docker SECCOMP prevents system calls issued by OCI runtime

I want to apply SECCOMP to the commands in the Docker image.
However, it seems to be applied to OCI runtime and fails to boot.

# docker run -it --security-opt seccomp:seccomp.json stali-rootfs-o0 /bin/sh
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "lstat /proc/self/fd/0: operation not permitted": unknown.

Can I exempt OCI runtime from the SECCOMP?


Kuniyasu Suzaki

The error is a bit confusing, but it’s actually the container’s process failing to start;

starting container process caused “lstat /proc/self/fd/0: operation not permitted”

When I run the docker command without " --security-opt seccomp:seccomp.json", it worked well.
I forgot to tell you that it runs on arm64 docker.
The seccomp.json is as follows.

{
  "defaultAction": "SCMP_ACT_ERRNO",
    "syscalls": [
    {
    "names": [
"setxattr",
"lsetxattr",
"fsetxattr",
"getxattr",
....
"prlimit64",
"setns",
"getrandom"
	],
	"action": "SCMP_ACT_ALLOW",
	"args": []
    }
  ]
}

I only want to apply SECCOMP to the commands in the docker image.

The way Linux containers work is a bit like how things like nginx or Apache drop priviliges when they start up.

The process that wants to be containerized drops its own privileges, enters new mount/network/user/pid/etc namespaces, applies seccomp profiles, etc…

In the case of runc/docker style containers, we force a process to be containerized rather than giving it a choice. To do this, runc (or any container runtime) starts up and starts applying the specified sec policies, namespaces, etc, then execs the requested binary so the new process is now just in those namespaces and has those policies applied.

So there is no real way to apply only to the end result. To get to the end the runtime has to apply it to itself.

I am afraid that this reply is out of the scope of this forum, but …

Should I use the rkt to limit system calls for applications in a container image only?
The home page says


Recommendations

  1. Only allow syscalls needed by an application, according to its typical usage

No. You cannot do what you want to do with Linux. It doesn’t matter what container runtime.

Thank you for your explanation.
Could you tell me how to enforce SECCOMP on Docker? I want to understand the implementation.

I run a Docker container as follows.

$ sudo docker run -it  --security-opt seccomp:seccomp.json ubuntu:16.04 bash
root@5fe17519d95c:/# bash
root@5fe17519d95c:/# ps
  PID TTY          TIME CMD
    1 ?        00:00:00 bash
   11 ?        00:00:00 bash
   16 ?        00:00:00 ps
root@5fe17519d95c:/# grep -i seccomp /proc/1/status
Seccomp:	2
root@5fe17519d95c:/# grep -i seccomp /proc/11/status
Seccomp:	2

2 bash processes run in a container. They are enforced SECCOMP. It is quite natural.
I also check the process on normal Linux with pstree command.

systemd(1)-+-ModemManager(1262)-+-{gdbus}(1291)
           |                    `-{gmain}(1288)
           |-NetworkManager(1271)-+-{gdbus}(1381)
(omit)
           |                    
           |-dockerd(1384)-+-containerd(1509)-+-containerd-shim(532)-+-bash(549)---bash(575)
           |               |                  |                      |-{containerd-shim}(533)
           |               |                  |                      |-{containerd-shim}(534)
           |               |                  |                      |-{containerd-shim}(535)
           |               |                  |                      |-{containerd-shim}(536)
           |               |                  |                      |-{containerd-shim}(537)
           |               |                  |                      |-{containerd-shim}(538)
           |               |                  |                      |-{containerd-shim}(540)
           |               |                  |                      `-{containerd-shim}(541)

$ grep -i seccomp /proc/532/status
Seccomp:	0
$ grep -i seccomp /proc/549/status
Seccomp:	2
$ grep -i seccomp /proc/575/status
Seccomp:	2
$ grep -i seccomp /proc/533/status
Seccomp:	0
$ grep -i seccomp /proc/534/status
Seccomp:	0
$ grep -i seccomp /proc/535/status
Seccomp:	0
$ grep -i seccomp /proc/536/status
Seccomp:	0
$ grep -i seccomp /proc/537/status
Seccomp:	0
$ grep -i seccomp /proc/538/status
Seccomp:	0
$ grep -i seccomp /proc/540/status
Seccomp:	0
$ grep -i seccomp /proc/541/status

containerd-shim(532) which is a parent process of bash(549) (= bash(1) in the container) is exempted from SECCOMP.
bash(549) and bash(575) on normal Linux are bash(1) and bash(11) on the container which are enforced by SECCOMP.
containerd-shim(533)-(541) are exempted from SECCOMP.

When is the SECCOMP enforced? And how?
I was thinking that containerd-shim(532) is enforced by SECCOMP when a Docker run is launched.
Why are containerd-shim (533)-(541) exempted from SECCOMP?

seccomp is enforced in runc.

The call chain looks like this: dockerd -> containerd -> containerd-shim -> runc

You can’t apply a seccomp profile that would prevent runc from being able to run your process.

Is my understanding correct?

The SECCOMP is inherited from “runc” process because Linux’s “seccomp” is a system call which limits the system calls for the process which issues “seccomp” only. Linux has no mechanism to enforce SECCOMP to an arbitrary process. Therefore, “runc” sets the SECCOMP by its process and issues “exec” for an application in a container.

The procedure is as follows.

dockerd -> containerd -> containerd-shim (fork and exec “runc” with SECCOMP)

dockerd -> containerd -> containerd-shim -> runc

dockerd -> containerd -> containerd-shim -> runc (“exec” bash in a cantainer)

dockerd -> containerd -> containerd-shim -> bash (keeps SECCOMP which is set up by runc)

In general, the process executed in a container (e.g., “bash”) has no mechanism to set up SECCOMP.

That looks right.
Except outside of any restrictions to the syscall, you should be able to to apply a new profile in the container as well.

I have tried to determine system calls used by runc (docker-runc?) using ftrace.
Unfortunately, I could not find system calls.

echo function > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
docker run -it ubuntu:14.04 bash
# ls
# exit
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace_pipe > trace.log
grep runc trace.log

Is there any good idea to determine system calls used by runc?