| 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394 |
- .. SPDX-License-Identifier: GPL-2.0
- =====================
- Syscall User Dispatch
- =====================
- Background
- ----------
- Compatibility layers like Wine need a way to efficiently emulate system
- calls of only a part of their process - the part that has the
- incompatible code - while being able to execute native syscalls without
- a high performance penalty on the native part of the process. Seccomp
- falls short on this task, since it has limited support to efficiently
- filter syscalls based on memory regions, and it doesn't support removing
- filters. Therefore a new mechanism is necessary.
- Syscall User Dispatch brings the filtering of the syscall dispatcher
- address back to userspace. The application is in control of a flip
- switch, indicating the current personality of the process. A
- multiple-personality application can then flip the switch without
- invoking the kernel, when crossing the compatibility layer API
- boundaries, to enable/disable the syscall redirection and execute
- syscalls directly (disabled) or send them to be emulated in userspace
- through a SIGSYS.
- The goal of this design is to provide very quick compatibility layer
- boundary crosses, which is achieved by not executing a syscall to change
- personality every time the compatibility layer executes. Instead, a
- userspace memory region exposed to the kernel indicates the current
- personality, and the application simply modifies that variable to
- configure the mechanism.
- There is a relatively high cost associated with handling signals on most
- architectures, like x86, but at least for Wine, syscalls issued by
- native Windows code are currently not known to be a performance problem,
- since they are quite rare, at least for modern gaming applications.
- Since this mechanism is designed to capture syscalls issued by
- non-native applications, it must function on syscalls whose invocation
- ABI is completely unexpected to Linux. Syscall User Dispatch, therefore
- doesn't rely on any of the syscall ABI to make the filtering. It uses
- only the syscall dispatcher address and the userspace key.
- As the ABI of these intercepted syscalls is unknown to Linux, these
- syscalls are not instrumentable via ptrace or the syscall tracepoints.
- Interface
- ---------
- A thread can setup this mechanism on supported kernels by executing the
- following prctl:
- prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
- <op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
- disable the mechanism globally for that thread. When
- PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
- [<offset>, <offset>+<length>) delimit a memory region interval
- from which syscalls are always executed directly, regardless of the
- userspace selector. This provides a fast path for the C library, which
- includes the most common syscall dispatchers in the native code
- applications, and also provides a way for the signal handler to return
- without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this
- interface should make sure that at least the signal trampoline code is
- included in this region. In addition, for syscalls that implement the
- trampoline code on the vDSO, that trampoline is never intercepted.
- [selector] is a pointer to a char-sized region in the process memory
- region, that provides a quick way to enable disable syscall redirection
- thread-wide, without the need to invoke the kernel directly. selector
- can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
- Any other value should terminate the program with a SIGSYS.
- Additionally, a tasks syscall user dispatch configuration can be peeked
- and poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace
- requests. This is useful for checkpoint/restart software.
- Security Notes
- --------------
- Syscall User Dispatch provides functionality for compatibility layers to
- quickly capture system calls issued by a non-native part of the
- application, while not impacting the Linux native regions of the
- process. It is not a mechanism for sandboxing system calls, and it
- should not be seen as a security mechanism, since it is trivial for a
- malicious application to subvert the mechanism by jumping to an allowed
- dispatcher region prior to executing the syscall, or to discover the
- address and modify the selector value. If the use case requires any
- kind of security sandboxing, Seccomp should be used instead.
- Any fork or exec of the existing process resets the mechanism to
- PR_SYS_DISPATCH_OFF.
|