| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188 |
- .. SPDX-License-Identifier: GPL-2.0
- =====================
- Introduction of mseal
- =====================
- :Author: Jeff Xu <jeffxu@chromium.org>
- Modern CPUs support memory permissions such as RW and NX bits. The memory
- permission feature improves security stance on memory corruption bugs, i.e.
- the attacker can’t just write to arbitrary memory and point the code to it,
- the memory has to be marked with X bit, or else an exception will happen.
- Memory sealing additionally protects the mapping itself against
- modifications. This is useful to mitigate memory corruption issues where a
- corrupted pointer is passed to a memory management system. For example,
- such an attacker primitive can break control-flow integrity guarantees
- since read-only memory that is supposed to be trusted can become writable
- or .text pages can get remapped. Memory sealing can automatically be
- applied by the runtime loader to seal .text and .rodata pages and
- applications can additionally seal security critical data at runtime.
- A similar feature already exists in the XNU kernel with the
- VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
- SYSCALL
- =======
- mseal syscall signature
- -----------------------
- ``int mseal(void \* addr, size_t len, unsigned long flags)``
- **addr**/**len**: virtual memory address range.
- The address range set by **addr**/**len** must meet:
- - The start address must be in an allocated VMA.
- - The start address must be page aligned.
- - The end address (**addr** + **len**) must be in an allocated VMA.
- - no gap (unallocated memory) between start and end address.
- The ``len`` will be paged aligned implicitly by the kernel.
- **flags**: reserved for future use.
- **Return values**:
- - **0**: Success.
- - **-EINVAL**:
- * Invalid input ``flags``.
- * The start address (``addr``) is not page aligned.
- * Address range (``addr`` + ``len``) overflow.
- - **-ENOMEM**:
- * The start address (``addr``) is not allocated.
- * The end address (``addr`` + ``len``) is not allocated.
- * A gap (unallocated memory) between start and end address.
- - **-EPERM**:
- * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
- **Note about error return**:
- - For above error cases, users can expect the given memory range is
- unmodified, i.e. no partial update.
- - There might be other internal errors/cases not listed here, e.g.
- error during merging/splitting VMAs, or the process reaching the maximum
- number of supported VMAs. In those cases, partial updates to the given
- memory range could happen. However, those cases should be rare.
- **Architecture support**:
- mseal only works on 64-bit CPUs, not 32-bit CPUs.
- **Idempotent**:
- users can call mseal multiple times. mseal on an already sealed memory
- is a no-action (not error).
- **no munseal**
- Once mapping is sealed, it can't be unsealed. The kernel should never
- have munseal, this is consistent with other sealing feature, e.g.
- F_SEAL_SEAL for file.
- Blocked mm syscall for sealed mapping
- -------------------------------------
- It might be important to note: **once the mapping is sealed, it will
- stay in the process's memory until the process terminates**.
- Example::
- *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
- rc = mseal(ptr, 4096, 0);
- /* munmap will fail */
- rc = munmap(ptr, 4096);
- assert(rc < 0);
- Blocked mm syscall:
- - munmap
- - mmap
- - mremap
- - mprotect and pkey_mprotect
- - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
- MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
- The first set of syscalls to block is munmap, mremap, mmap. They can
- either leave an empty space in the address space, therefore allowing
- replacement with a new mapping with new set of attributes, or can
- overwrite the existing mapping with another mapping.
- mprotect and pkey_mprotect are blocked because they changes the
- protection bits (RWX) of the mapping.
- Certain destructive madvise behaviors, specifically MADV_DONTNEED,
- MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
- risks when applied to anonymous memory by threads lacking write
- permissions. Consequently, these operations are prohibited under such
- conditions. The aforementioned behaviors have the potential to modify
- region contents by discarding pages, effectively performing a memset(0)
- operation on the anonymous memory.
- Kernel will return -EPERM for blocked syscalls.
- When blocked syscall return -EPERM due to sealing, the memory regions may
- or may not be changed, depends on the syscall being blocked:
- - munmap: munmap is atomic. If one of VMAs in the given range is
- sealed, none of VMAs are updated.
- - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
- when mprotect over multiple VMAs, mprotect might update the beginning
- VMAs before reaching the sealed VMA and return -EPERM.
- - mmap and mremap: undefined behavior.
- Use cases
- =========
- - glibc:
- The dynamic linker, during loading ELF executables, can apply sealing to
- mapping segments.
- - Chrome browser: protect some security sensitive data structures.
- When not to use mseal
- =====================
- Applications can apply sealing to any virtual memory region from userspace,
- but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
- apply the sealing. This is because the sealed mapping *won’t be unmapped*
- until the process terminates or the exec system call is invoked.
- For example:
- - aio/shm
- aio/shm can call mmap and munmap on behalf of userspace, e.g.
- ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
- the lifetime of the process. If those memories are sealed from userspace,
- then munmap will fail, causing leaks in VMA address space during the
- lifetime of the process.
- - ptr allocated by malloc (heap)
- Don't use mseal on the memory ptr return from malloc().
- malloc() is implemented by allocator, e.g. by glibc. Heap manager might
- allocate a ptr from brk or mapping created by mmap.
- If an app calls mseal on a ptr returned from malloc(), this can affect
- the heap manager's ability to manage the mappings; the outcome is
- non-deterministic.
- Example::
- ptr = malloc(size);
- /* don't call mseal on ptr return from malloc. */
- mseal(ptr, size);
- /* free will success, allocator can't shrink heap lower than ptr */
- free(ptr);
- mseal doesn't block
- ===================
- In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
- attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
- memory is immutable.
- As Jann Horn pointed out in [3], there are still a few ways to write
- to RO memory, which is, in a way, by design. And those could be blocked
- by different security measures.
- Those cases are:
- - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
- - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
- - userfaultfd.
- The idea that inspired this patch comes from Stephen Röttger’s work in V8
- CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
- Reference
- =========
- - [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
- - [2] https://man.openbsd.org/mimmutable.2
- - [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
- - [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
|