Docker SECCOMP prevents system calls issued by OCI runtime

kunisuzaki · April 28, 2018, 9:52am

I want to apply SECCOMP to the commands in the Docker image.
However, it seems to be applied to OCI runtime and fails to boot.

# docker run -it --security-opt seccomp:seccomp.json stali-rootfs-o0 /bin/sh
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "lstat /proc/self/fd/0: operation not permitted": unknown.

Can I exempt OCI runtime from the SECCOMP?

Kuniyasu Suzaki

thaJeztah · April 28, 2018, 10:03am

The error is a bit confusing, but it’s actually the container’s process failing to start;

starting container process caused “lstat /proc/self/fd/0: operation not permitted”

kunisuzaki · April 28, 2018, 11:52am

When I run the docker command without " --security-opt seccomp:seccomp.json", it worked well.
I forgot to tell you that it runs on arm64 docker.
The seccomp.json is as follows.

{
  "defaultAction": "SCMP_ACT_ERRNO",
    "syscalls": [
    {
    "names": [
"setxattr",
"lsetxattr",
"fsetxattr",
"getxattr",
....
"prlimit64",
"setns",
"getrandom"
	],
	"action": "SCMP_ACT_ALLOW",
	"args": []
    }
  ]
}

I only want to apply SECCOMP to the commands in the docker image.

cpuguy83 · April 28, 2018, 12:23pm

The way Linux containers work is a bit like how things like nginx or Apache drop priviliges when they start up.

The process that wants to be containerized drops its own privileges, enters new mount/network/user/pid/etc namespaces, applies seccomp profiles, etc…

In the case of runc/docker style containers, we force a process to be containerized rather than giving it a choice. To do this, runc (or any container runtime) starts up and starts applying the specified sec policies, namespaces, etc, then execs the requested binary so the new process is now just in those namespaces and has those policies applied.

So there is no real way to apply only to the end result. To get to the end the runtime has to apply it to itself.

kunisuzaki · April 28, 2018, 3:09pm

I am afraid that this reply is out of the scope of this forum, but …

Should I use the rkt to limit system calls for applications in a container image only?
The home page says

Recommendations

Only allow syscalls needed by an application, according to its typical usage

cpuguy83 · April 28, 2018, 4:05pm

No. You cannot do what you want to do with Linux. It doesn’t matter what container runtime.

kunisuzaki · April 29, 2018, 11:03am

Thank you for your explanation.
Could you tell me how to enforce SECCOMP on Docker? I want to understand the implementation.

I run a Docker container as follows.

$ sudo docker run -it  --security-opt seccomp:seccomp.json ubuntu:16.04 bash
root@5fe17519d95c:/# bash
root@5fe17519d95c:/# ps
  PID TTY          TIME CMD
    1 ?        00:00:00 bash
   11 ?        00:00:00 bash
   16 ?        00:00:00 ps
root@5fe17519d95c:/# grep -i seccomp /proc/1/status
Seccomp:	2
root@5fe17519d95c:/# grep -i seccomp /proc/11/status
Seccomp:	2

2 bash processes run in a container. They are enforced SECCOMP. It is quite natural.
I also check the process on normal Linux with pstree command.

systemd(1)-+-ModemManager(1262)-+-{gdbus}(1291)
           |                    `-{gmain}(1288)
           |-NetworkManager(1271)-+-{gdbus}(1381)
(omit)
           |                    
           |-dockerd(1384)-+-containerd(1509)-+-containerd-shim(532)-+-bash(549)---bash(575)
           |               |                  |                      |-{containerd-shim}(533)
           |               |                  |                      |-{containerd-shim}(534)
           |               |                  |                      |-{containerd-shim}(535)
           |               |                  |                      |-{containerd-shim}(536)
           |               |                  |                      |-{containerd-shim}(537)
           |               |                  |                      |-{containerd-shim}(538)
           |               |                  |                      |-{containerd-shim}(540)
           |               |                  |                      `-{containerd-shim}(541)

$ grep -i seccomp /proc/532/status
Seccomp:	0
$ grep -i seccomp /proc/549/status
Seccomp:	2
$ grep -i seccomp /proc/575/status
Seccomp:	2
$ grep -i seccomp /proc/533/status
Seccomp:	0
$ grep -i seccomp /proc/534/status
Seccomp:	0
$ grep -i seccomp /proc/535/status
Seccomp:	0
$ grep -i seccomp /proc/536/status
Seccomp:	0
$ grep -i seccomp /proc/537/status
Seccomp:	0
$ grep -i seccomp /proc/538/status
Seccomp:	0
$ grep -i seccomp /proc/540/status
Seccomp:	0
$ grep -i seccomp /proc/541/status

containerd-shim(532) which is a parent process of bash(549) (= bash(1) in the container) is exempted from SECCOMP.
bash(549) and bash(575) on normal Linux are bash(1) and bash(11) on the container which are enforced by SECCOMP.
containerd-shim(533)-(541) are exempted from SECCOMP.

When is the SECCOMP enforced? And how?
I was thinking that containerd-shim(532) is enforced by SECCOMP when a Docker run is launched.
Why are containerd-shim (533)-(541) exempted from SECCOMP?

cpuguy83 · April 29, 2018, 11:47am

seccomp is enforced in runc.

The call chain looks like this: dockerd -> containerd -> containerd-shim -> runc

You can’t apply a seccomp profile that would prevent runc from being able to run your process.

kunisuzaki · April 29, 2018, 1:55pm

Is my understanding correct?

The SECCOMP is inherited from “runc” process because Linux’s “seccomp” is a system call which limits the system calls for the process which issues “seccomp” only. Linux has no mechanism to enforce SECCOMP to an arbitrary process. Therefore, “runc” sets the SECCOMP by its process and issues “exec” for an application in a container.

The procedure is as follows.

dockerd -> containerd -> containerd-shim (fork and exec “runc” with SECCOMP)

dockerd -> containerd -> containerd-shim -> runc

dockerd -> containerd -> containerd-shim -> runc (“exec” bash in a cantainer)

dockerd -> containerd -> containerd-shim -> bash (keeps SECCOMP which is set up by runc)

In general, the process executed in a container (e.g., “bash”) has no mechanism to set up SECCOMP.

cpuguy83 · April 30, 2018, 2:28pm

That looks right.
Except outside of any restrictions to the syscall, you should be able to to apply a new profile in the container as well.

kunisuzaki · May 3, 2018, 2:32pm

I have tried to determine system calls used by runc (docker-runc?) using ftrace.
Unfortunately, I could not find system calls.

echo function > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
docker run -it ubuntu:14.04 bash
# ls
# exit
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace_pipe > trace.log
grep runc trace.log

Is there any good idea to determine system calls used by runc?

Topic		Replies	Views
How to build the moby project with self-maintained libseccomp? Support	0	1152	August 24, 2019
Runc list does not show anything Support	1	1657	April 11, 2019
Is there a standard way to check if the docker daemon is running? Support	2	4743	February 13, 2018
How to disable privileged mode for security of docker-in-docker?	0	967	July 22, 2019
Build Moby from GitHub sources Support	2	4208	December 27, 2021

Docker SECCOMP prevents system calls issued by OCI runtime

Related topics