• Stars
    star
    168
  • Rank 225,507 (Top 5 %)
  • Language
    Go
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Linux Subsystem for FreeBSD (😈 on 🐧)

LSF: Linux Subsystem for FreeBSD

Emulates FreeBSD on Linux. Designed to be extensible to support other Unix-like OS personalities too.

Usage

Tested on Ubuntu 22.04 (kernel 5.15). Needs kernel 5.6 at least.

With Docker (easy)

(linux)$ docker build -t lsf .

(linux)$ docker run -it --rm --security-opt seccomp=unconfined lsf
# file /bin/sh
/bin/sh: ELF 64-bit LSB pie executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 13.1, FreeBSD-style, stripped
# uname -a
FreeBSD 177f2177ddab 13.1-RELEASE-p1 FreeBSD 13.1-RELEASE-p1 LSF  amd64

Without Docker (hard, dangerous)

⚠️ Running LSF outside a container is highly discouraged, and may result in breaking the host Linux.

make
install _output/bin/lsf ~/bin/

mkdir -p ~/freebsd/rootfs
curl -SL http://ftp.freebsd.org/pub/FreeBSD/releases/amd64/13.1-RELEASE/base.txz | tar CxJ ~/freebsd/rootfs

cd ~/freebsd/rootfs
export LD_LIBRARY_PATH=$(pwd)/lib

lsf -- libexec/ld-elf.so.1 usr/bin/uname -a

Status

POC.

  • Crashes very frequently.
  • Lots of syscalls are still unimplemented.
  • Only the x86_64 (amd64) architecture is supported.

Troubleshooting

  • Retry docker run several times if you see Error: input/output error.
  • Use docker run -e LSF_DEBUG=1 to enable debug output.
  • Use docker exec -it <CONTAINERID> /lsf -- /bin/sh to open another shell.

How it works

Executable pages

Surprisingly the Linux kernel does not validate the OSABI of the ELF binaries on execve(). So, LSF can "just" load ELFOSABI_FREEBSD binaries without cooking up the PROT_EXEC pages by itself.

Syscalls

Syscalls are trapped using the plain old PTRACE_SYSCALL.

Unlike UML, PTRACE_SYSEMU, which reduces the ptrace overhead when the trapped syscall rarely needs to be executed, is not used. Because in the case of LSF, most syscalls can be just passed through to the Linux kernel but with different register values such as the syscall number in the RAX register.

Syscall User Dispatch is not used either.

Syscall ABI

The syscall ABI is almost same across Linux and FreeBSD: The syscall number is stored in the RAX register, and the syscall arguments are stored in the RDI, RSI, RDX, R10, R8, and R9 registers.

This is similar to the System V AMD64 ABI calling convention for the userspace (RDI, RSI, RDX, RCX, R8, R9). However, it should be noted that in the case of the syscalls, the fourth argument is stored in R10, not RCX, because the syscall instruction (0F 05) clobbers RCX.

The returned value is stored back in the RAX register. An errno is stored in the RAX register too, but as a negative value.

In addition, FreeBSD processes expect the CF flag of the RFLAGS register to be set on an error. LSF sets the CF flag using PTRACE_SETREGS.

Syscall handlers

Some syscalls can't be just passed through by changing the register values, when the corresponding syscall is missing in Linux, or the syscall has an incompatible argument such as a struct with different struct members:

int fstat(int fd, struct stat *buf);

In such a case, LSF rewrites the syscall number in the RAX register to a "NOP" syscall number (getpid()), and handles the original syscall arguments in the userspace when the "NOP" syscall exits.

The userspace handler uses pidfd_getfd() to fetch the file descriptors, translates the struct definitions, and calls Linux syscalls to emulate the requested FreeBSD syscall.

The pidfd_getfd() syscall has been available since Linux kernel 5.6, but disabled in Docker's default seccomp profile. So, running LSF inside Docker needs --security-opt seccomp=unconfined, or at least a custom seccomp profile to enable pidfd_getfd(). Enabling pidfd_getfd() does NOT require acquiring the CAP_SYS_PTRACE capability.

Instead of using pidfd_getfd(), LSF could alternatively just use symlinks under /proc/<PID>/fd/ and position information under /proc/<PID>/fdinfo/ to create yet another descriptor with the similar internal state, but this approach is not as robust as pidfd_getfd(), and very unlikely to work with descriptors of non-regular files.

Thread-local Storages

FreeBSD processes expect the TLS pointer (FSBASE) to be initialized by the kernel, while the Linux kernel does not provide it.

LSF uses PTRACE_PEEKTEXT to inject the syscall instruction (0F 05) into the code of the FreeBSD process for allocating the TLS with brk(), and after single-stepping the syscall instruction, LSF restores the code and rewinds the instruction pointer to the original position.

The TLS is initialized with the the .tdata and .tbss sections of the ELF. At the end of the TLS, there is the TLS pointer that points to itself. The FSBASE register is set to this pointer.

Initial registers

The initial registers are different and modified using PTRACE_SETREGS.

RSP RDI FSBASE
Linux stack - -
FreeBSD stack (aligned) stack end of TLS

The stack layout is similar. The stack begins with argc, argv, envp, and auxv, but auxv is slightly incompatible across Linux and FreeBSD.

Auxv

FreeBSD processes expect the AT_BASE element in the auxv to be always provided with a non-zero value, but the Linux kernel sets AT_BASE to zero when the ELF interpreter (/libexec/ld-elf.so.1) is executed directly. In such a case, LSF modifies the AT_BASE value on the stack to be the base address parsed from /proc/<PID>/maps.

Also, some of the auxv elements are incompatible and nullified.

Comparison with similar projects

Non-Linux on Linux

FreeBSD (and others) on Linux:

  • LilyVM was a project in 2003-2013 to run the modified NetBSD/FreeBSD/Linux kernel using ptrace, while LSF only emulates syscalls without using the actual guest kernel code. LilyVM also supported NetBSD and FreeBSD hosts, while LSF only supports Linux hosts.

Darwin on Linux:

  • Darling (until 2022) depended on a Linux kernel module, while LSF does not.
  • Darling (since 2022) intercepts dylib calls, while LSF intercepts syscalls.
  • Limbo uses Syscall User Dispatch for trapping syscalls, while LSF uses ptrace.

SunOS and Solaris on Linux:

  • CONFIG_SUNOS_EMUL (for SunOS 4/Solaris 1) and CONFIG_SOLARIS_EMUL (for SunOS 5/Solaris 2) were natively present in the Linux kernel for the SPARC architecture until 2008 (Linux 2.6.26). These were built in the Linux kernel, while LSF works as a user mode process.

System V derivatives on Linux:

  • iBCS2 compatibility layer (c. 1994-1999?) was available for the Linux kernel (1.0-2.2) to support the Intel Binary Compatibility Standard 2 for running binaries of SCO UNIX and System V Release 4 derivatives. This was compiled in the Linux kernel, while LSF works as a user mode process.
  • The Linux A.B.I. patch (aka ibcs-3 later) (2001-2011) was the kernel 2.4/2.6 port of the iBCS2 compatibility layer. This was compiled in the Linux kernel, while LSF works as a user mode process.
  • iBCS64 (2014-2019) was a 64-bit fork of the Linux A.B.I. patch (ibcs-3). This was compiled as a Linux kernel module, while LSF works as a user mode process.
  • ibcs-us(2019-) is a userspace reimplementation of iBCS64. ibcs-us uses SIGSEGV handlers for trapping syscalls, while LSF uses ptrace. Also, ibcs-us needs CAP_SYS_RAWIO while LSF does not.

Windows on Linux:

  • Wine intercepts DLL calls, while LSF intercepts syscalls.

Linux on non-Linux

Linux on FreeBSD:

Linux on Darwin:

  • Noah was a project in 2016-2020 to use VMM (but without using the actual Linux kernel) for trapping syscalls, while LSF uses ptrace.
  • uKontainer uses frankenlibc to intercept libc calls, and uses LKL to execute the Linux kernel in userspace, while LSF uses ptrace to trap syscalls without using the actual guest kernel.
  • Lima runs the actual Linux kernel on VMM, while LSF only emulates syscalls.

Linux on Solaris:

  • Linux Branded Zone was built in the Solaris 10 kernel (removed in Solaris 11), while LSF works as a user mode process.

Linux on System V derivatives:

  • Lxrun (1997-2001) used SIGSEGV handlers for trapping Linux syscalls on SCO OpenServer, UnixWare, and Solaris, while LSF uses ptrace.

Linux on Windows:

  • LINE was a project in 2001 to emulate Linux by trapping syscalls using Win32 debug events or a Windows NT kernel driver int80.sys, while LSF uses ptrace for trapping syscalls. Non-NT mode of LINE was very similar to LSF, although the target operating systems were different.
  • Umlwin32 was a project in 2002 to run the modified Linux kernel using LINE's int80.sys for trapping syscalls, while LSF does not use the actual guest kernel code, and uses ptrace for trapping syscalls.
  • coLinux was a project in 2004-2017 to run the modified Linux kernel as a Windows NT kernel driver, while LSF only emulates syscal and works as a usermode process.
  • WSL version 1 is built in the Windows kernel, while LSF works as a usermode process.
  • WSL version 2 runs the actual Linux kernel on VMM, while LSF only emulates syscalls.

Misc

  • BSD on Windows (1995-1996) was a product to run 4.4BSD-Lite binaries on Windows 3.1 and 95 (but not on NT). Not much is known about this product today.

More Repositories

1

issues-docker

🐳 Docker Issues and Tips (aufs/overlay/btrfs..)
165
star
2

containerized-systemd

Dockerfile examples for containerized systemd (mainly for test environments)
Shell
121
star
3

apt-transport-oci

OCI transport plugin for apt-get (i.e., apt-get over ghcr.io)
Go
101
star
4

go-netfilter-queue

Go bindings for libnetfilter_queue (Forked from openshift/geard)
Go
89
star
5

buildbench

benchmark tool for Docker, BuildKit, img, Buildah, and Kaniko
Shell
77
star
6

aspectgo

Aspect-Oriented Programming framework for Go
Go
65
star
7

myaot

Succeeded by https://github.com/yomaytk/elfconv
Go
34
star
8

vexllm

silence negligible CVE alerts using LLM
Go
26
star
9

instance-per-pod

Create a dedicated IaaS instance per Pod to mitigate container breakout (including CPU vulnerabilities depending on the instance type)
Go
22
star
10

filegrain

transport-agnostic, fine-grained content-addressable container image layout
Go
22
star
11

nac

Not A Container, for macOS
C
20
star
12

clone3-workaround

Workaround for running ubuntu:21.10, fedora:35, and other glibc >= 2.34 distros on Docker <= 20.10.9
Go
19
star
13

go-dag

Minimalistic DAG utility with concurrent scheduler
Go
18
star
14

yamlctl

An experimental tool to modify YAMLs without losing (most of) comment lines.
Go
16
star
15

moby-snapshot

Binary snapshot of Moby (irregularly updated)
Shell
15
star
16

ntimes

⌚ time(1) with average time, flaky rate, ..
Go
13
star
17

awesome-swarm

[OUTDATED] 🐳 🐳 🐳 A curated list of Swarm (Docker >=1.12) resources and projects
12
star
18

go-replay

record-less semi-deterministic replayer for Go programs
Go
11
star
19

cni-isolation

CNI Bridge Isolation Plugin (Merged into the firewall plugin v1.1.0)
Go
9
star
20

AkihiroSuda

Profile page for https://github.com/AkihiroSuda
8
star
21

podman-network-create-for-rootless-podman

`podman network create` for Rootless Podman
Dockerfile
3
star
22

passt-mirror

Mirror of https://passt.top/passt/ . Pull requests are not accepted here.
C
3
star
23

go-wrap-to-percent-w

convert `Wrap(err, "foo")` to `Errorf("foo: %w", err)`
Go
3
star
24

test18180

Checker for docker/docker#18180
Shell
3
star
25

multidocker

🐳 🐳 🐳 Multiple Docker daemons on a single machine
Go
3
star
26

go-docker-machine

Go binding for Docker Machine
Go
2
star
27

securebind-docker

recursive read-only bind-mount for Docker
Go
2
star
28

critest-rootless-cgroup2

critest example (rootless+cgroup2)
Shell
2
star
29

go-list-func

List up functions in a Go package
Go
1
star
30

anbox-android-images-mirror

Mirror of https://build.anbox.io/android-images
1
star
31

dind-ovs-ryu-pipework-superpack.BAK

Docker-in-Docker + OVS + ryu + pipework Superpack (MOVED to osrg/dockerfiles)
Shell
1
star
32

x-sys-unix-auto-eintr

golang.org/x/sys/unix with automatic EINTR handler
Go
1
star
33

z42

0️⃣ Zeroconf (mDNS) for Amazon EC2
Python
1
star
34

MicroEarthquake.BAK

🌏 reproduces flaky bugs by fuzzing process scheduling
Python
1
star
35

go-linuxsched

Go binding for sched_setattr(2) and sched_getattr(2)
Go
1
star