123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241 |
- .. _userfaultfd:
- ===========
- Userfaultfd
- ===========
- Objective
- =========
- Userfaults allow the implementation of on-demand paging from userland
- and more generally they allow userland to take control of various
- memory page faults, something otherwise only the kernel code could do.
- For example userfaults allows a proper and more optimal implementation
- of the PROT_NONE+SIGSEGV trick.
- Design
- ======
- Userfaults are delivered and resolved through the userfaultfd syscall.
- The userfaultfd (aside from registering and unregistering virtual
- memory ranges) provides two primary functionalities:
- 1) read/POLLIN protocol to notify a userland thread of the faults
- happening
- 2) various UFFDIO_* ioctls that can manage the virtual memory regions
- registered in the userfaultfd that allows userland to efficiently
- resolve the userfaults it receives via 1) or to manage the virtual
- memory in the background
- The real advantage of userfaults if compared to regular virtual memory
- management of mremap/mprotect is that the userfaults in all their
- operations never involve heavyweight structures like vmas (in fact the
- userfaultfd runtime load never takes the mmap_sem for writing).
- Vmas are not suitable for page- (or hugepage) granular fault tracking
- when dealing with virtual address spaces that could span
- Terabytes. Too many vmas would be needed for that.
- The userfaultfd once opened by invoking the syscall, can also be
- passed using unix domain sockets to a manager process, so the same
- manager process could handle the userfaults of a multitude of
- different processes without them being aware about what is going on
- (well of course unless they later try to use the userfaultfd
- themselves on the same region the manager is already tracking, which
- is a corner case that would currently return -EBUSY).
- API
- ===
- When first opened the userfaultfd must be enabled invoking the
- UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
- a later API version) which will specify the read/POLLIN protocol
- userland intends to speak on the UFFD and the uffdio_api.features
- userland requires. The UFFDIO_API ioctl if successful (i.e. if the
- requested uffdio_api.api is spoken also by the running kernel and the
- requested features are going to be enabled) will return into
- uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
- respectively all the available features of the read(2) protocol and
- the generic ioctl available.
- The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
- defines what memory types are supported by the userfaultfd and what
- events, except page fault notifications, may be generated.
- If the kernel supports registering userfaultfd ranges on hugetlbfs
- virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
- uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
- set if the kernel supports registering userfaultfd ranges on shared
- memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
- MAP_SHARED, memfd_create, etc).
- The userland application that wants to use userfaultfd with hugetlbfs
- or shared memory need to set the corresponding flag in
- uffdio_api.features to enable those features.
- If the userland desires to receive notifications for events other than
- page faults, it has to verify that uffdio_api.features has appropriate
- UFFD_FEATURE_EVENT_* bits set. These events are described in more
- detail below in "Non-cooperative userfaultfd" section.
- Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
- be invoked (if present in the returned uffdio_api.ioctls bitmask) to
- register a memory range in the userfaultfd by setting the
- uffdio_register structure accordingly. The uffdio_register.mode
- bitmask will specify to the kernel which kind of faults to track for
- the range (UFFDIO_REGISTER_MODE_MISSING would track missing
- pages). The UFFDIO_REGISTER ioctl will return the
- uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
- userfaults on the range registered. Not all ioctls will necessarily be
- supported for all memory types depending on the underlying virtual
- memory backend (anonymous memory vs tmpfs vs real filebacked
- mappings).
- Userland can use the uffdio_register.ioctls to manage the virtual
- address space in the background (to add or potentially also remove
- memory from the userfaultfd registered range). This means a userfault
- could be triggering just before userland maps in the background the
- user-faulted page.
- The primary ioctl to resolve userfaults is UFFDIO_COPY. That
- atomically copies a page into the userfault registered range and wakes
- up the blocked userfaults (unless uffdio_copy.mode &
- UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
- UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
- half copied page since it'll keep userfaulting until the copy has
- finished.
- QEMU/KVM
- ========
- QEMU/KVM is using the userfaultfd syscall to implement postcopy live
- migration. Postcopy live migration is one form of memory
- externalization consisting of a virtual machine running with part or
- all of its memory residing on a different node in the cloud. The
- userfaultfd abstraction is generic enough that not a single line of
- KVM kernel code had to be modified in order to add postcopy live
- migration to QEMU.
- Guest async page faults, FOLL_NOWAIT and all other GUP features work
- just fine in combination with userfaults. Userfaults trigger async
- page faults in the guest scheduler so those guest processes that
- aren't waiting for userfaults (i.e. network bound) can keep running in
- the guest vcpus.
- It is generally beneficial to run one pass of precopy live migration
- just before starting postcopy live migration, in order to avoid
- generating userfaults for readonly guest regions.
- The implementation of postcopy live migration currently uses one
- single bidirectional socket but in the future two different sockets
- will be used (to reduce the latency of the userfaults to the minimum
- possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
- The QEMU in the source node writes all pages that it knows are missing
- in the destination node, into the socket, and the migration thread of
- the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
- ioctls on the userfaultfd in order to map the received pages into the
- guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
- A different postcopy thread in the destination node listens with
- poll() to the userfaultfd in parallel. When a POLLIN event is
- generated after a userfault triggers, the postcopy thread read() from
- the userfaultfd and receives the fault address (or -EAGAIN in case the
- userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
- by the parallel QEMU migration thread).
- After the QEMU postcopy thread (running in the destination node) gets
- the userfault address it writes the information about the missing page
- into the socket. The QEMU source node receives the information and
- roughly "seeks" to that page address and continues sending all
- remaining missing pages from that new page offset. Soon after that
- (just the time to flush the tcp_wmem queue through the network) the
- migration thread in the QEMU running in the destination node will
- receive the page that triggered the userfault and it'll map it as
- usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
- was spontaneously sent by the source or if it was an urgent page
- requested through a userfault).
- By the time the userfaults start, the QEMU in the destination node
- doesn't need to keep any per-page state bitmap relative to the live
- migration around and a single per-page bitmap has to be maintained in
- the QEMU running in the source node to know which pages are still
- missing in the destination node. The bitmap in the source node is
- checked to find which missing pages to send in round robin and we seek
- over it when receiving incoming userfaults. After sending each page of
- course the bitmap is updated accordingly. It's also useful to avoid
- sending the same page twice (in case the userfault is read by the
- postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
- thread).
- Non-cooperative userfaultfd
- ===========================
- When the userfaultfd is monitored by an external manager, the manager
- must be able to track changes in the process virtual memory
- layout. Userfaultfd can notify the manager about such changes using
- the same read(2) protocol as for the page fault notifications. The
- manager has to explicitly enable these events by setting appropriate
- bits in uffdio_api.features passed to UFFDIO_API ioctl:
- UFFD_FEATURE_EVENT_FORK
- enable userfaultfd hooks for fork(). When this feature is
- enabled, the userfaultfd context of the parent process is
- duplicated into the newly created process. The manager
- receives UFFD_EVENT_FORK with file descriptor of the new
- userfaultfd context in the uffd_msg.fork.
- UFFD_FEATURE_EVENT_REMAP
- enable notifications about mremap() calls. When the
- non-cooperative process moves a virtual memory area to a
- different location, the manager will receive
- UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
- new addresses of the area and its original length.
- UFFD_FEATURE_EVENT_REMOVE
- enable notifications about madvise(MADV_REMOVE) and
- madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
- be generated upon these calls to madvise. The uffd_msg.remove
- will contain start and end addresses of the removed area.
- UFFD_FEATURE_EVENT_UNMAP
- enable notifications about memory unmapping. The manager will
- get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
- end addresses of the unmapped area.
- Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
- are pretty similar, they quite differ in the action expected from the
- userfaultfd manager. In the former case, the virtual memory is
- removed, but the area is not, the area remains monitored by the
- userfaultfd, and if a page fault occurs in that area it will be
- delivered to the manager. The proper resolution for such page fault is
- to zeromap the faulting address. However, in the latter case, when an
- area is unmapped, either explicitly (with munmap() system call), or
- implicitly (e.g. during mremap()), the area is removed and in turn the
- userfaultfd context for such area disappears too and the manager will
- not get further userland page faults from the removed area. Still, the
- notification is required in order to prevent manager from using
- UFFDIO_COPY on the unmapped area.
- Unlike userland page faults which have to be synchronous and require
- explicit or implicit wakeup, all the events are delivered
- asynchronously and the non-cooperative process resumes execution as
- soon as manager executes read(). The userfaultfd manager should
- carefully synchronize calls to UFFDIO_COPY with the events
- processing. To aid the synchronization, the UFFDIO_COPY ioctl will
- return -ENOSPC when the monitored process exits at the time of
- UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
- its virtual memory layout simultaneously with outstanding UFFDIO_COPY
- operation.
- The current asynchronous model of the event delivery is optimal for
- single threaded non-cooperative userfaultfd manager implementations. A
- synchronous event delivery model can be added later as a new
- userfaultfd feature to facilitate multithreading enhancements of the
- non cooperative manager, for example to allow UFFDIO_COPY ioctls to
- run in parallel to the event reception. Single threaded
- implementations should continue to use the current async event
- delivery model instead.
|