123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307 |
- =================
- KVM VCPU Requests
- =================
- Overview
- ========
- KVM supports an internal API enabling threads to request a VCPU thread to
- perform some activity. For example, a thread may request a VCPU to flush
- its TLB with a VCPU request. The API consists of the following functions::
- /* Check if any requests are pending for VCPU @vcpu. */
- bool kvm_request_pending(struct kvm_vcpu *vcpu);
- /* Check if VCPU @vcpu has request @req pending. */
- bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
- /* Clear request @req for VCPU @vcpu. */
- void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
- /*
- * Check if VCPU @vcpu has request @req pending. When the request is
- * pending it will be cleared and a memory barrier, which pairs with
- * another in kvm_make_request(), will be issued.
- */
- bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
- /*
- * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
- * with another in kvm_check_request(), prior to setting the request.
- */
- void kvm_make_request(int req, struct kvm_vcpu *vcpu);
- /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
- bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
- Typically a requester wants the VCPU to perform the activity as soon
- as possible after making the request. This means most requests
- (kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
- and kvm_make_all_cpus_request() has the kicking of all VCPUs built
- into it.
- VCPU Kicks
- ----------
- The goal of a VCPU kick is to bring a VCPU thread out of guest mode in
- order to perform some KVM maintenance. To do so, an IPI is sent, forcing
- a guest mode exit. However, a VCPU thread may not be in guest mode at the
- time of the kick. Therefore, depending on the mode and state of the VCPU
- thread, there are two other actions a kick may take. All three actions
- are listed below:
- 1) Send an IPI. This forces a guest mode exit.
- 2) Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guest
- mode that wait on waitqueues. Waking them removes the threads from
- the waitqueues, allowing the threads to run again. This behavior
- may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
- 3) Nothing. When the VCPU is not in guest mode and the VCPU thread is not
- sleeping, then there is nothing to do.
- VCPU Mode
- ---------
- VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
- guest is running in guest mode or not, as well as some specific
- outside guest mode states. The architecture may use ``vcpu->mode`` to
- ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
- as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
- even to ensure IPI acknowledgements are waited upon (see "Waiting for
- Acknowledgements"). The following modes are defined:
- OUTSIDE_GUEST_MODE
- The VCPU thread is outside guest mode.
- IN_GUEST_MODE
- The VCPU thread is in guest mode.
- EXITING_GUEST_MODE
- The VCPU thread is transitioning from IN_GUEST_MODE to
- OUTSIDE_GUEST_MODE.
- READING_SHADOW_PAGE_TABLES
- The VCPU thread is outside guest mode, but it wants the sender of
- certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
- thread is done reading the page tables.
- VCPU Request Internals
- ======================
- VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
- This means general bitops, like those documented in [atomic-ops]_ could
- also be used, e.g. ::
- clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
- However, VCPU request users should refrain from doing so, as it would
- break the abstraction. The first 8 bits are reserved for architecture
- independent requests, all additional bits are available for architecture
- dependent requests.
- Architecture Independent Requests
- ---------------------------------
- KVM_REQ_TLB_FLUSH
- KVM's common MMU notifier may need to flush all of a guest's TLB
- entries, calling kvm_flush_remote_tlbs() to do so. Architectures that
- choose to use the common kvm_flush_remote_tlbs() implementation will
- need to handle this VCPU request.
- KVM_REQ_MMU_RELOAD
- When shadow page tables are used and memory slots are removed it's
- necessary to inform each VCPU to completely refresh the tables. This
- request is used for that.
- KVM_REQ_PENDING_TIMER
- This request may be made from a timer handler run on the host on behalf
- of a VCPU. It informs the VCPU thread to inject a timer interrupt.
- KVM_REQ_UNHALT
- This request may be made from the KVM common function kvm_vcpu_block(),
- which is used to emulate an instruction that causes a CPU to halt until
- one of an architectural specific set of events and/or interrupts is
- received (determined by checking kvm_arch_vcpu_runnable()). When that
- event or interrupt arrives kvm_vcpu_block() makes the request. This is
- in contrast to when kvm_vcpu_block() returns due to any other reason,
- such as a pending signal, which does not indicate the VCPU's halt
- emulation should stop, and therefore does not make the request.
- KVM_REQUEST_MASK
- ----------------
- VCPU requests should be masked by KVM_REQUEST_MASK before using them with
- bitops. This is because only the lower 8 bits are used to represent the
- request's number. The upper bits are used as flags. Currently only two
- flags are defined.
- VCPU Request Flags
- ------------------
- KVM_REQUEST_NO_WAKEUP
- This flag is applied to requests that only need immediate attention
- from VCPUs running in guest mode. That is, sleeping VCPUs do not need
- to be awaken for these requests. Sleeping VCPUs will handle the
- requests when they are awaken later for some other reason.
- KVM_REQUEST_WAIT
- When requests with this flag are made with kvm_make_all_cpus_request(),
- then the caller will wait for each VCPU to acknowledge its IPI before
- proceeding. This flag only applies to VCPUs that would receive IPIs.
- If, for example, the VCPU is sleeping, so no IPI is necessary, then
- the requesting thread does not wait. This means that this flag may be
- safely combined with KVM_REQUEST_NO_WAKEUP. See "Waiting for
- Acknowledgements" for more information about requests with
- KVM_REQUEST_WAIT.
- VCPU Requests with Associated State
- ===================================
- Requesters that want the receiving VCPU to handle new state need to ensure
- the newly written state is observable to the receiving VCPU thread's CPU
- by the time it observes the request. This means a write memory barrier
- must be inserted after writing the new state and before setting the VCPU
- request bit. Additionally, on the receiving VCPU thread's side, a
- corresponding read barrier must be inserted after reading the request bit
- and before proceeding to read the new state associated with it. See
- scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
- [memory-barriers]_.
- The pair of functions, kvm_check_request() and kvm_make_request(), provide
- the memory barriers, allowing this requirement to be handled internally by
- the API.
- Ensuring Requests Are Seen
- ==========================
- When making requests to VCPUs, we want to avoid the receiving VCPU
- executing in guest mode for an arbitrary long time without handling the
- request. We can be sure this won't happen as long as we ensure the VCPU
- thread checks kvm_request_pending() before entering guest mode and that a
- kick will send an IPI to force an exit from guest mode when necessary.
- Extra care must be taken to cover the period after the VCPU thread's last
- kvm_request_pending() check and before it has entered guest mode, as kick
- IPIs will only trigger guest mode exits for VCPU threads that are in guest
- mode or at least have already disabled interrupts in order to prepare to
- enter guest mode. This means that an optimized implementation (see "IPI
- Reduction") must be certain when it's safe to not send the IPI. One
- solution, which all architectures except s390 apply, is to:
- - set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
- the last kvm_request_pending() check;
- - enable interrupts atomically when entering the guest.
- This solution also requires memory barriers to be placed carefully in both
- the requesting thread and the receiving VCPU. With the memory barriers we
- can exclude the possibility of a VCPU thread observing
- !kvm_request_pending() on its last check and then not receiving an IPI for
- the next request made of it, even if the request is made immediately after
- the check. This is done by way of the Dekker memory barrier pattern
- (scenario 10 of [lwn-mb]_). As the Dekker pattern requires two variables,
- this solution pairs ``vcpu->mode`` with ``vcpu->requests``. Substituting
- them into the pattern gives::
- CPU1 CPU2
- ================= =================
- local_irq_disable();
- WRITE_ONCE(vcpu->mode, IN_GUEST_MODE); kvm_make_request(REQ, vcpu);
- smp_mb(); smp_mb();
- if (kvm_request_pending(vcpu)) { if (READ_ONCE(vcpu->mode) ==
- IN_GUEST_MODE) {
- ...abort guest entry... ...send IPI...
- } }
- As stated above, the IPI is only useful for VCPU threads in guest mode or
- that have already disabled interrupts. This is why this specific case of
- the Dekker pattern has been extended to disable interrupts before setting
- ``vcpu->mode`` to IN_GUEST_MODE. WRITE_ONCE() and READ_ONCE() are used to
- pedantically implement the memory barrier pattern, guaranteeing the
- compiler doesn't interfere with ``vcpu->mode``'s carefully planned
- accesses.
- IPI Reduction
- -------------
- As only one IPI is needed to get a VCPU to check for any/all requests,
- then they may be coalesced. This is easily done by having the first IPI
- sending kick also change the VCPU mode to something !IN_GUEST_MODE. The
- transitional state, EXITING_GUEST_MODE, is used for this purpose.
- Waiting for Acknowledgements
- ----------------------------
- Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
- be sent, and the acknowledgements to be waited upon, even when the target
- VCPU threads are in modes other than IN_GUEST_MODE. For example, one case
- is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
- is set after disabling interrupts. To support these cases, the
- KVM_REQUEST_WAIT flag changes the condition for sending an IPI from
- checking that the VCPU is IN_GUEST_MODE to checking that it is not
- OUTSIDE_GUEST_MODE.
- Request-less VCPU Kicks
- -----------------------
- As the determination of whether or not to send an IPI depends on the
- two-variable Dekker memory barrier pattern, then it's clear that
- request-less VCPU kicks are almost never correct. Without the assurance
- that a non-IPI generating kick will still result in an action by the
- receiving VCPU, as the final kvm_request_pending() check does for
- request-accompanying kicks, then the kick may not do anything useful at
- all. If, for instance, a request-less kick was made to a VCPU that was
- just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
- the VCPU thread may continue its entry without actually having done
- whatever it was the kick was meant to initiate.
- One exception is x86's posted interrupt mechanism. In this case, however,
- even the request-less VCPU kick is coupled with the same
- local_irq_disable() + smp_mb() pattern described above; the ON bit
- (Outstanding Notification) in the posted interrupt descriptor takes the
- role of ``vcpu->requests``. When sending a posted interrupt, PIR.ON is
- set before reading ``vcpu->mode``; dually, in the VCPU thread,
- vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
- IN_GUEST_MODE.
- Additional Considerations
- =========================
- Sleeping VCPUs
- --------------
- VCPU threads may need to consider requests before and/or after calling
- functions that may put them to sleep, e.g. kvm_vcpu_block(). Whether they
- do or not, and, if they do, which requests need consideration, is
- architecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
- to check if it should awaken. One reason to do so is to provide
- architectures a function where requests may be checked if necessary.
- Clearing Requests
- -----------------
- Generally it only makes sense for the receiving VCPU thread to clear a
- request. However, in some circumstances, such as when the requesting
- thread and the receiving VCPU thread are executed serially, such as when
- they are the same thread, or when they are using some form of concurrency
- control to temporarily execute synchronously, then it's possible to know
- that the request may be cleared immediately, rather than waiting for the
- receiving VCPU thread to handle the request in VCPU RUN. The only current
- examples of this are kvm_vcpu_block() calls made by VCPUs to block
- themselves. A possible side-effect of that call is to make the
- KVM_REQ_UNHALT request, which may then be cleared immediately when the
- VCPU returns from the call.
- References
- ==========
- .. [atomic-ops] Documentation/core-api/atomic_ops.rst
- .. [memory-barriers] Documentation/memory-barriers.txt
- .. [lwn-mb] https://lwn.net/Articles/573436/
|